Simple installation, configuration and use of sqoop and flume

1. Sqoop

1.1 Introduction to Sqoop

  • Sqoop is a tool for batch data migration between structured data and Hadoop

    • Structured data can be relational databases such as MySQL and Oracle.

      • Import data from relational databases into Hadoop and its related systems

      • Extract and export data from Hadoop systems to relational databases

    • The bottom layer uses MapReduce to implement data

      • During command execution, map 0%, Reduce0%—-》map 100%, Reduce 100%
      • 01-MR execution process
id name age
1 zhangsan 18

02-The relationship between sqoop and data transfer

1.2 Sqoop installation

  • Download, upload, unzip, rename and authorize

    • https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.3.7/sqoop-1.3.7.bin__hadoop-2.6.0.tar.gz
  • Upload to the /home/hadoop directory, drag and drop directly into the home directory in xshell

  • #Sqoop installation
    sudo tar -xvf sqoop-1.3.7.bin__hadoop-2.6.0.tar.gz -C /usr/local
    #changename
    sudo mv /usr/local/sqoop-1.3.7.bin__hadoop-2.6.0/ /usr/local/sqoop
    #authorization
    sudo chown -R hadoop /usr/local/sqoop
    

1.3 Sqoop configuration and verification

1.3.1 sqoop configuration

#1. Modify the configuration file
mv /usr/local/sqoop/conf/sqoop-env-template.sh /usr/local/sqoop/conf/sqoop-env.sh

# Enter the configuration directory and write the paths of each component
cd /usr/local/sqoop/conf/
sudo vim sqoop-env.sh

Snipaste_2023-10-25_14-01-52

#2. Upload jar file
cd /usr/local/sqoop/lib/
#2.1 Upload the MySQL driver file and drag and drop it into xshell

#2.2 Copy hive driver files
cp /usr/local/hive/lib/hive-common-2.3.7.jar /usr/local/sqoop/lib/

#3. Configure environment variables
#Edit environment variables
vim /home/hadoop/.bashrc

#Add the following content at the end of the environment variable
export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

#Refresh environment variables
source /home/hadoop/.bashrc

#Verify whether the installation is successful
sqoop version

1.3.2 Create sqoop user in MySQL

#View administrator account and password
sudo cat /etc/mysql/debian.cnf

#Log in with the account and password you viewed
mysql -u debian-sys-maint -p

#Execute the following command after successful login. Please refer to the picture below.
#Create sqoop user,
create user 'sqoop'@'%' identified by '123456';
#And authorize the user
grant all privileges on sqoop.* to 'sqoop'@'%';
#Refresh to make authorization effective
flush privileges;

#quit
exit;

Execute MySQL example:

Snipaste_2023-10-25_14-05-20

1.3.3 Verify whether sqoop runs successfully and common errors:

  • #Test whether the database can be successfully connected
#Test whether the database can be successfully connected
sqoop list-databases --connect jdbc:mysql://master:3306/?useSSL=false --username sqoop --password 123456
  1. When using the command to report an error:

    #Test whether the database can be successfully connected
    sqoop list-databases --connect jdbc:mysql://master:3306/?useSSL=false --username sqoop --password 123456
    
  2. **The error message is as follows: **ERROR manager.CatalogQueryManager: Failed to list databases

com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

Full information below:

  • Warning: /usr/local/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
    Please set $HCAT_HOME to the root of your HCatalog installation.
    Warning: /usr/local/sqoop/../accumulo does not exist! Accumulo imports will fail.
    Please set $ACCUMULO_HOME to the root of your Accumulo installation.
    . . .
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
    Caused by: java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
                                  at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:211)
    at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:301)
    ... 24 more
    

Reason: Remote login is not enabled and the configuration needs to be modified

 **Solution:**
#View status (the firewall is inactive)
sudo ufw status
#View port
netstat -an | grep 3306

07-5-sqoop solves the problem of failed connection

 ~~~shell

#Edit port
#Comment out the bind-address on line 43
sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf

##To modify the port, you need to restart the virtual machine
sudo reboot

07-6-sqoop solves the problem of failed connection

Success status:

07-7-Connection to MySQL successful status

? **If Sqoop fails to transfer data to Hive, please note: **Previous Hive initialization was carried out in a single-machine pseudo-distributed state, which is inconsistent with the current cluster state. You need to reformat Hive and delete the hive element of MySQL. Database

#hiveIf you want to reconfigure
#After configuring the master's hive, do not initialize it, and perform the following operations according to the situation.
#Scenario 1: If the stand-alone node is not configured properly,
    #Follow the Hive stand-alone installation steps. After the master configures hive, it needs to be synchronized to slave1 and slave2.
    scp /usr/local/hive hadoop@slave1:/usr/local/
    scp /usr/local/hive hadoop@slave2:/usr/local/

    #Synchronize system environment variables
    scp /home/hadoop/.bashrc hadoop@slave1:/home/hadoop
    scp /home/hadoop/.bashrc hadoop@slave2:/home/hadoop

    #Refresh on three nodes
    source /home/hadoop/.bashrc

#Scenario 2: The single machine has been successful, and other nodes have been cloned.
    #View hive metadata information in mysql data
    #To view the default account and password, use the following command:
    sudo cat /etc/mysql/debian.cnf
    mysql -u debian-sys-maint -p
    #Enter the password displayed by the cat command

    #What are the current node databases?
    show databases;

    #View password
    #After the super user login is performed on all three nodes, check whether there is a hive database and delete it if there is one.
    drop database hive;

    #Three node initialization operations
    schematool -dbType mysql -initSchema

Verification status – Hive initialization successful:

07-4-Hive reformatting

Execute the query database command again:

#Test whether the database can be successfully connected
sqoop list-databases --connect jdbc:mysql://master:3306/?useSSL=false --username sqoop --password 123456

1.3.4 Data preparation before use

1.3.4.1 mysql data preparation (the following operations can be performed in dbeaver)

07-8-MySQL preparation

#(1) Log in to MySQL using the sqoop user and use the following command:
#If you use dbeaver to connect to MySQL, you don’t need to enter this command anymore.
mysql -u sqoop -p

#(2) Create and use sqoop database, use the following command:
create database sqoop
use sqoop

#(3) Create the student table to demonstrate importing MySQL data into HDFS, use the following command:
CREATE TABLE IF NOT EXISTS `student`(
`id` int PRIMARY KEY COMMENT 'number',
`name` varchar(20) COMMENT 'name',
`age` int COMMENT 'age'
)COMMENT 'Student table';

#(4) Insert some data into the student table, use the following command:
INSERT INTO student VALUES(1, 'zhangsan', 20);
INSERT INTO student VALUES(2, 'lisi', 24);
INSERT INTO student VALUES(3, 'wangwu', 18);
INSERT INTO student VALUES(4, 'zhaoliui', 22);

#(5) Create the student2 table to load the data exported by Hive, use the following command:
CREATE TABLE IF NOT EXISTS `student2`(
`id` int PRIMARY KEY COMMENT 'number',
`name` varchar(20) COMMENT 'name',
`age` int COMMENT 'age'
)COMMENT 'Student table';

Dbeaver remote login to MySQL failed:

image-20221024171548427

#View status
sudo ufw status
#View port
netstat -an | grep 3306

#Edit port
#Comment out the bind-address on line 43
sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
#bind-address = 127.0.0.1

##To modify the port, you need to restart the virtual machine
sudo reboot

After data preparation is complete:

  • student table:

07-9 Connect to MySQL

  • student2 table:

07-10 Prepare Hive table

1.3.4.2 Hive preparation (also executed in dbeaver):
--(1) Start hive, use the following command:
hiveserver2
--(2) Open DBeaver to connect to Hive

--(3) Create sqoop database, use the following command:
CREATE DATABASE sqoop;

--(4) Using sqoop database, use the following command:
USE sqoop;

--(5) Create a student table to load the data imported by MySQL, use the following command:
CREATE TABLE IF NOT EXISTS student(
id INT COMMENT 'number',
name STRING COMMENT 'name',
age INT COMMENT 'age'
) COMMENT 'Student table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' ';
  • If you get an error when starting Hive after building HadoopHA
When hive starts, it prompts java.net.UnknownHostException: ns

04-hive error exception

Pay attention to copying the core-site.xml and hdfs-site.xml of HDFS to the conf of the hive directory

cp /usr/local/hadoop/etc/hadoop/hdfs-site.xml /usr/local/hive/conf
cp /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hive/conf

1.5 Sqoop command

Common commands for Sqoop

Command Description
list-databases List all database names
< strong>list-tables List all tables under a certain database
import< /strong> Import data into HDFS cluster, hive, hbase, hdfs itself, etc.
export Export HDFS cluster data
help Print sqoop help information
version Print sqoop version information

Public parameters of Sqoop

Command Description
–connect URL to connect to relational database
–username Username to connect to the database
–password Password to connect to the database

Sqoop’s import command parameters

Parameters Description
–fields-terminated-by The column separator in Hive, the default is comma
–lines-terminated-by The line separator in Hive, the default is\
–append Append data to the existing DataSet in HDFS. If this parameter is used, sqoop will import the data into the temporary file directory first and then merge it.
–columns Specify the fields to be imported
< strong>–m or –num-mappers Start N maps to import data in parallel, the default is 4.
–query or **–e** Import the query result data, which must be accompanied by the parameter – target-dir, –hive-table, if there is a where condition in the query, the $CONDITIONS keyword must be added after the condition
–table Table name of relational database
–target-dir
Specify the HDFS path to import the data storage
–null-string If the string type column is null, replace it with the specified string
–null-non-string non If the string type column is null, replace it with the specified string
–check-column as an incremental import judgment Column name
–incremental mode: append or lastmodified
–last-value Specify a value to mark the location of incremental import

Sqoop’s export command parameters

Parameters Description
–input-fields-terminated-by The column separator in Hive, the default is comma
–input-lines-terminated-by The line separator in Hive, the default is\
< strong>–export-dir The source directory of HDFS where data is stored
-mor –num-mappers Start N maps to export data in parallel, the default is 4
< strong>–table Specify the table in which RDBMS to export to
–update-key Update the fields of a certain column
–update-mode updateonly or allowinsert (default )

Sqoop command examples

  • Import to HDFS
#View existing database names in MySQL
sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username sqoop --password 123456

#To view the tables in the Sqoop database in MySQL, use the following command
sqoop list-tables --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456

#Import all MySQL data to HDFS and execute the following command
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student --target-dir /user/student --delete-target-dir -m 1

#After the execution is completed, go to HDFS to check whether the data was imported successfully, use the following command:
hdfs dfs -cat /user/student/part-m-00000

#Import some mysql data to HDFS (filter when importing)
#--query is not used together with --table
#Must add $CONDITIONS after where
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --target-dir /user/student --delete-target-dir -m 1 --query 'select * from student where age <20 and $CONDITIONS'

View the path specified by –target-dir

View the content on the web management interface on 05-hdfs

  • Import into Hive

    #Import MySQL data into hive
    sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student --target-dir /user/student2 --delete-target-dir --hive-import - -fields-terminated-by " " --columns id,name,age --hive-overwrite --hive-table sqoop.student -m 1
    
    #Import some MySQL data into hive (overwrite import)
    sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --target-dir /user/student2 --delete-target-dir -m 1 --query 'select * from student where age <20 and $CONDITIONS' --hive-import --fields-terminated-by " " --columns id,name,age --hive-overwrite --hive-table sqoop.student2
    
    #Incrementally import some MySQL data into hive
    #--incremental append cannot be used with --delete-target-dir
    sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student --target-dir /user/student2 --hive-import --fields-terminated-by " " --columns id,name,age --hive-table sqoop.student2 --check-column id --incremental append --last-value 3 -m 1
    
    • Problem: Insufficient permissions to import data, import into hive failed

      • 06-web management error

        • Hdfs page operation file appears Permission denied: user=dr.who,

        • #In xshell
          hdfs dfs -chmod -R 755 /
          
  • Import to HBase

#Import data to HBase, you need to create the corresponding table student in advance
#Before importing data
hbase shell
create 'student','info'

#Start executing the import command
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 -table student -hbase-table "student" --hbase-row-key "id" --column-family " info" --columns "id,name,age" --split-by id -m -1
  • Hive export to MySQL
#Sqoop export command case
sqoop export --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student2 --export-dir /usr/local/hive/warehouse/sqoop.db/student --input -fields-terminated-by " " -m 1

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly

2. Flume

2.1Flume Introduction

  • Flume is a distributed, highly reliable, and highly available platform that collects, aggregates, and moves large batches of log data from different data sources to the data center (* *HDFS) storage system

    • 1. Data can be collected at high speed, and the collected data can be stored on HDFS in the desired file format and compression method;

    • 2. The transaction function ensures that the data is not lost during the data collection process;

      • atomicity
    • 3. Some sources ensure that Flume can continue to collect data at the last collection point after restarting after it hangs, truly achieving zero data loss.

2.2Flume Architecture

  • Agent is the smallest independent running unit in Flume. An agent is a JVM (java virtual machine)
    • Contains three core components, namely source, channel and sink
    • 08-flume principle

2.3 Flume installation

#Download, upload, unzip, rename and authorize
https://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz


#Upload to the /home/hadoop directory
sudo tar -xvf apache-flume-1.9.0-bin.tar.gz -C /usr/local

sudo mv /usr/local/apache-flume-1.9.0-bin/ /usr/local/flume

sudo chown -R hadoop /usr/local/flume

2.4 Flume configuration

Configuring environment variables

#Edit environment variables
vim /home/hadoop/.bashrc

#Add the following content at the end of the environment variable
export FLUME_HOME=/usr/local/flume
export PATH=$PATH:$FLUME_HOME/bin

#Refresh environment variables
source /home/hadoop/.bashrc

Configure Agent

# Give the agent a name a1
# Set the sources of a1 to be called r1
a1.sources = r1

# Set the sinks of a1 to be called k1
a1.sinks = k1

# Set the channels of a1 to be called c1
a1.channels = c1

Configure Source

# Set the type of r1 to exec, which is used to collect data generated by the command
a1.sources.r1.type = exec

# Set r1 to collect data generated by the tail -F command
a1.sources.r1.command = tail -F /home/hadoop/tail-test.txt

Configuring Channel

  • Two common types: MemoryChannel and FileChannel
# Set the type of c1 to memory
a1.channels.c1.type = memory

#Set the buffer capacity of c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

Configuring Sink

# Set the type of k1 to hdfs
a1.sinks.k1.type = hdfs
# Set the k1 output path and create the corresponding directory on hdfs according to time
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
#Set the data output by k1 to be saved as text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# Put the timestamp in the header of the data
a1.sinks.k1.hdfs.useLocalTimeStamp = true

Assembling Source, Channel and Sink

# Set r1 to connect to c1
a1.sources.r1.channels = c1

# Set k1 to connect to c1
a1.sinks.k1.channel = c1

2.5 Flume usage

Understand the tail -F command

#(1) Enter the /home/hadoop directory and use the following command:
cd /home/hadoop/
#(2) Create the touch tail-test.txt file and use the following command:
touch tail-test.txt
#(3) Append some content to the tail-test.txt file and use the following command:
echo 'hello 11111' >> tail-test.txt
echo 'hello 22222'>> tail-test.txt
echo 'hello 33333'>> tail-test.txt
#(4) To view the contents of the tail-test.txt file, use the following command:
cat tail-test.txt
#(5) Copy (newly open) an xshell window to monitor changes in the content of the tail-test.txt file, use the following command:
tail -F tail-test.txt
#(6) Return to the previous xshell window, continue to append some content to the tail-test.txt file, use the following command:
echo 'hello 44444' >> tail-test.txt
echo 'hello 55555'>> tail-test.txt
echo 'hello 66666'>> tail-test.txt
#Check whether the tail -F command monitors content changes

Use flume

Goal: Collect the new content in the tail-test.txt file to HDFS

# Use Flume to collect the new content in the tail-test.txt file to HDFS.
#(1) Open a new xshell window, create the exec-memory-hdfs.properties file, and use the following command:
touch exec-memory-hdfs.properties
#(2) Edit the touch exec-memory-hdfs.properties file and fill in the following content:
sudo vim exec-memory-hdfs.properties
# Single node flume configuration file
# Give the agent a name a1
# Set the sources of a1 to be called r1
a1.sources = r1

# Set the sinks of a1 to be called k1
a1.sinks = k1

# Set the channels of a1 to be called c1
a1.channels = c1

#Set the type of r1 to exec, which is used to collect data generated by the command
a1.sources.r1.type = exec

# Set r1 to collect data generated by the tail -F command
a1.sources.r1.command = tail -F /home/hadoop/tail-test.txt

# Set the type of c1 to memory
a1.channels.c1.type = memory

#Set the buffer capacity of c1
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Set the type of k1 to hdfs
a1.sinks.k1.type = hdfs

# Set the k1 output path and create the corresponding directory on hdfs according to time
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-

#Set the data output by k1 to be saved as text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

# Put the timestamp in the header of the data
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Set r1 to connect to c1
a1.sources.r1.channels = c1

# Set k1 to connect to c1
a1.sinks.k1.channel = c1

Start Flume

  • Startthree nodeszookeeper
zkServer.sh start
  • Start hdfs and yarn first
start-dfs.sh
start-yarn.sh
  • Start Flume
#Start Flume
flume-ng agent -n a1 -c conf -f /home/hadoop/exec-memory-hdfs.properties

Verify flume

#(4) Append a large amount of data to the tail-test.txt file in the first xshell window
echo 'hello 44444' >> tail-test.txt
echo 'hello 55555' >> tail-test.txt
echo 'hello 6666' >> tail-test.txt

#2. In xshell, use the command
hdfs dfs -cat /flume/events/ directory name/file name, you can see the data
  • Go to the HDFS web monitoring page to see if data has been collected.
  1. You can see the newly generated directory

09-flume verification

  1. In xshell, use the command

    hdfs dfs -cat /flume/events/ directory name/file name, you can see the data
    

10-flume verification