1. Sqoop
1.1 Introduction to Sqoop
-
Sqoop is a tool for batch data migration between structured data and Hadoop
-
Structured data can be relational databases such as MySQL and Oracle.
-
Import data from relational databases into Hadoop and its related systems
-
Extract and export data from Hadoop systems to relational databases
-
-
The bottom layer uses MapReduce to implement data
- During command execution, map 0%, Reduce0%—-》map 100%, Reduce 100%
-
id | name | age |
---|---|---|
1 | zhangsan | 18 |
1.2 Sqoop installation
-
Download, upload, unzip, rename and authorize
- https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.3.7/sqoop-1.3.7.bin__hadoop-2.6.0.tar.gz
-
Upload to the /home/hadoop directory, drag and drop directly into the home directory in xshell
-
#Sqoop installation sudo tar -xvf sqoop-1.3.7.bin__hadoop-2.6.0.tar.gz -C /usr/local #changename sudo mv /usr/local/sqoop-1.3.7.bin__hadoop-2.6.0/ /usr/local/sqoop #authorization sudo chown -R hadoop /usr/local/sqoop
1.3 Sqoop configuration and verification
1.3.1 sqoop configuration
#1. Modify the configuration file mv /usr/local/sqoop/conf/sqoop-env-template.sh /usr/local/sqoop/conf/sqoop-env.sh # Enter the configuration directory and write the paths of each component cd /usr/local/sqoop/conf/ sudo vim sqoop-env.sh
#2. Upload jar file cd /usr/local/sqoop/lib/ #2.1 Upload the MySQL driver file and drag and drop it into xshell #2.2 Copy hive driver files cp /usr/local/hive/lib/hive-common-2.3.7.jar /usr/local/sqoop/lib/ #3. Configure environment variables #Edit environment variables vim /home/hadoop/.bashrc #Add the following content at the end of the environment variable export SQOOP_HOME=/usr/local/sqoop export PATH=$PATH:$SQOOP_HOME/bin #Refresh environment variables source /home/hadoop/.bashrc #Verify whether the installation is successful sqoop version
1.3.2 Create sqoop user in MySQL
#View administrator account and password sudo cat /etc/mysql/debian.cnf #Log in with the account and password you viewed mysql -u debian-sys-maint -p #Execute the following command after successful login. Please refer to the picture below. #Create sqoop user, create user 'sqoop'@'%' identified by '123456'; #And authorize the user grant all privileges on sqoop.* to 'sqoop'@'%'; #Refresh to make authorization effective flush privileges; #quit exit;
Execute MySQL example:
1.3.3 Verify whether sqoop runs successfully and common errors:
- #Test whether the database can be successfully connected
#Test whether the database can be successfully connected sqoop list-databases --connect jdbc:mysql://master:3306/?useSSL=false --username sqoop --password 123456
When using the command to report an error:
#Test whether the database can be successfully connected sqoop list-databases --connect jdbc:mysql://master:3306/?useSSL=false --username sqoop --password 123456**The error message is as follows: **ERROR manager.CatalogQueryManager: Failed to list databases
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
Full information below:
Warning: /usr/local/sqoop/../hcatalog does not exist! HCatalog jobs will fail. Please set $HCAT_HOME to the root of your HCatalog installation. Warning: /usr/local/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. . . . at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243) at org.apache.sqoop.Sqoop.main(Sqoop.java:252) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:211) at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:301) ... 24 moreReason: Remote login is not enabled and the configuration needs to be modified
**Solution:**#View status (the firewall is inactive) sudo ufw status #View port netstat -an | grep 3306~~~shell#Edit port
#Comment out the bind-address on line 43
sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf##To modify the port, you need to restart the virtual machine
sudo rebootSuccess status:
? **If Sqoop fails to transfer data to Hive, please note: **Previous Hive initialization was carried out in a single-machine pseudo-distributed state, which is inconsistent with the current cluster state. You need to reformat Hive and delete the hive element of MySQL. Database
#hiveIf you want to reconfigure #After configuring the master's hive, do not initialize it, and perform the following operations according to the situation. #Scenario 1: If the stand-alone node is not configured properly, #Follow the Hive stand-alone installation steps. After the master configures hive, it needs to be synchronized to slave1 and slave2. scp /usr/local/hive hadoop@slave1:/usr/local/ scp /usr/local/hive hadoop@slave2:/usr/local/ #Synchronize system environment variables scp /home/hadoop/.bashrc hadoop@slave1:/home/hadoop scp /home/hadoop/.bashrc hadoop@slave2:/home/hadoop #Refresh on three nodes source /home/hadoop/.bashrc #Scenario 2: The single machine has been successful, and other nodes have been cloned. #View hive metadata information in mysql data #To view the default account and password, use the following command: sudo cat /etc/mysql/debian.cnf mysql -u debian-sys-maint -p #Enter the password displayed by the cat command #What are the current node databases? show databases; #View password #After the super user login is performed on all three nodes, check whether there is a hive database and delete it if there is one. drop database hive; #Three node initialization operations schematool -dbType mysql -initSchemaVerification status – Hive initialization successful:
Execute the query database command again:
#Test whether the database can be successfully connected sqoop list-databases --connect jdbc:mysql://master:3306/?useSSL=false --username sqoop --password 123456
1.3.4 Data preparation before use
1.3.4.1 mysql data preparation (the following operations can be performed in dbeaver)
#(1) Log in to MySQL using the sqoop user and use the following command: #If you use dbeaver to connect to MySQL, you don’t need to enter this command anymore. mysql -u sqoop -p #(2) Create and use sqoop database, use the following command: create database sqoop use sqoop #(3) Create the student table to demonstrate importing MySQL data into HDFS, use the following command: CREATE TABLE IF NOT EXISTS `student`( `id` int PRIMARY KEY COMMENT 'number', `name` varchar(20) COMMENT 'name', `age` int COMMENT 'age' )COMMENT 'Student table'; #(4) Insert some data into the student table, use the following command: INSERT INTO student VALUES(1, 'zhangsan', 20); INSERT INTO student VALUES(2, 'lisi', 24); INSERT INTO student VALUES(3, 'wangwu', 18); INSERT INTO student VALUES(4, 'zhaoliui', 22); #(5) Create the student2 table to load the data exported by Hive, use the following command: CREATE TABLE IF NOT EXISTS `student2`( `id` int PRIMARY KEY COMMENT 'number', `name` varchar(20) COMMENT 'name', `age` int COMMENT 'age' )COMMENT 'Student table';
Dbeaver remote login to MySQL failed:
#View status sudo ufw status #View port netstat -an | grep 3306 #Edit port #Comment out the bind-address on line 43 sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf #bind-address = 127.0.0.1 ##To modify the port, you need to restart the virtual machine sudo reboot
After data preparation is complete:
- student table:
- student2 table:
1.3.4.2 Hive preparation (also executed in dbeaver):
--(1) Start hive, use the following command: hiveserver2 --(2) Open DBeaver to connect to Hive --(3) Create sqoop database, use the following command: CREATE DATABASE sqoop; --(4) Using sqoop database, use the following command: USE sqoop; --(5) Create a student table to load the data imported by MySQL, use the following command: CREATE TABLE IF NOT EXISTS student( id INT COMMENT 'number', name STRING COMMENT 'name', age INT COMMENT 'age' ) COMMENT 'Student table' ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
- If you get an error when starting Hive after building HadoopHA
When hive starts, it prompts java.net.UnknownHostException: ns
Pay attention to copying the core-site.xml and hdfs-site.xml of HDFS to the conf of the hive directory
cp /usr/local/hadoop/etc/hadoop/hdfs-site.xml /usr/local/hive/conf cp /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hive/conf
1.5 Sqoop command
Common commands for Sqoop
Command | Description |
---|---|
list-databases | List all database names |
< strong>list-tables | List all tables under a certain database |
import< /strong> | Import data into HDFS cluster, hive, hbase, hdfs itself, etc. |
export | Export HDFS cluster data |
help | Print sqoop help information |
version | Print sqoop version information |
Public parameters of Sqoop
Command | Description |
---|---|
–connect | URL to connect to relational database |
–username | Username to connect to the database |
–password | Password to connect to the database |
Sqoop’s import command parameters
Parameters | Description |
---|---|
–fields-terminated-by | The column separator in Hive, the default is comma |
–lines-terminated-by | The line separator in Hive, the default is\ |
–append | Append data to the existing DataSet in HDFS. If this parameter is used, sqoop will import the data into the temporary file directory first and then merge it. |
–columns | Specify the fields to be imported |
< strong>–m or –num-mappers | Start N maps to import data in parallel, the default is 4. |
–query or **–e** | Import the query result data, which must be accompanied by the parameter – target-dir, –hive-table, if there is a where condition in the query, the $CONDITIONS keyword must be added after the condition |
–table | Table name of relational database |
–target-dir |
Specify the HDFS path to import the data storage |
–null-string | If the string type column is null, replace it with the specified string |
–null-non-string | non If the string type column is null, replace it with the specified string |
–check-column | as an incremental import judgment Column name |
–incremental | mode: append or lastmodified |
–last-value | Specify a value to mark the location of incremental import |
Sqoop’s export command parameters
Parameters | Description |
---|---|
–input-fields-terminated-by | The column separator in Hive, the default is comma |
–input-lines-terminated-by | The line separator in Hive, the default is\ |
< strong>–export-dir | The source directory of HDFS where data is stored |
-mor –num-mappers | Start N maps to export data in parallel, the default is 4 |
< strong>–table | Specify the table in which RDBMS to export to |
–update-key | Update the fields of a certain column |
–update-mode | updateonly or allowinsert (default ) |
Sqoop command examples
- Import to HDFS
#View existing database names in MySQL sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username sqoop --password 123456 #To view the tables in the Sqoop database in MySQL, use the following command sqoop list-tables --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 #Import all MySQL data to HDFS and execute the following command sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student --target-dir /user/student --delete-target-dir -m 1 #After the execution is completed, go to HDFS to check whether the data was imported successfully, use the following command: hdfs dfs -cat /user/student/part-m-00000 #Import some mysql data to HDFS (filter when importing) #--query is not used together with --table #Must add $CONDITIONS after where sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --target-dir /user/student --delete-target-dir -m 1 --query 'select * from student where age <20 and $CONDITIONS'
View the path specified by –target-dir
-
Import into Hive
#Import MySQL data into hive sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student --target-dir /user/student2 --delete-target-dir --hive-import - -fields-terminated-by " " --columns id,name,age --hive-overwrite --hive-table sqoop.student -m 1 #Import some MySQL data into hive (overwrite import) sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --target-dir /user/student2 --delete-target-dir -m 1 --query 'select * from student where age <20 and $CONDITIONS' --hive-import --fields-terminated-by " " --columns id,name,age --hive-overwrite --hive-table sqoop.student2 #Incrementally import some MySQL data into hive #--incremental append cannot be used with --delete-target-dir sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student --target-dir /user/student2 --hive-import --fields-terminated-by " " --columns id,name,age --hive-table sqoop.student2 --check-column id --incremental append --last-value 3 -m 1
-
Problem: Insufficient permissions to import data, import into hive failed
-
-
Hdfs page operation file appears Permission denied: user=dr.who,
-
#In xshell hdfs dfs -chmod -R 755 /
-
-
-
-
Import to HBase
#Import data to HBase, you need to create the corresponding table student in advance #Before importing data hbase shell create 'student','info' #Start executing the import command sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 -table student -hbase-table "student" --hbase-row-key "id" --column-family " info" --columns "id,name,age" --split-by id -m -1
- Hive export to MySQL
#Sqoop export command case sqoop export --connect jdbc:mysql://localhost:3306/sqoop --username sqoop --password 123456 --table student2 --export-dir /usr/local/hive/warehouse/sqoop.db/student --input -fields-terminated-by " " -m 1
2. Flume
2.1Flume Introduction
-
Flume is a distributed, highly reliable, and highly available platform that collects, aggregates, and moves large batches of log data from different data sources to the data center (* *HDFS) storage system
-
1. Data can be collected at high speed, and the collected data can be stored on HDFS in the desired file format and compression method;
-
2. The transaction function ensures that the data is not lost during the data collection process;
- atomicity
-
3. Some sources ensure that Flume can continue to collect data at the last collection point after restarting after it hangs, truly achieving zero data loss.
-
2.2Flume Architecture
- Agent is the smallest independent running unit in Flume. An agent is a JVM (java virtual machine)
- Contains three core components, namely source, channel and sink
2.3 Flume installation
#Download, upload, unzip, rename and authorize https://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz #Upload to the /home/hadoop directory sudo tar -xvf apache-flume-1.9.0-bin.tar.gz -C /usr/local sudo mv /usr/local/apache-flume-1.9.0-bin/ /usr/local/flume sudo chown -R hadoop /usr/local/flume
2.4 Flume configuration
Configuring environment variables
#Edit environment variables vim /home/hadoop/.bashrc #Add the following content at the end of the environment variable export FLUME_HOME=/usr/local/flume export PATH=$PATH:$FLUME_HOME/bin #Refresh environment variables source /home/hadoop/.bashrc
Configure Agent
# Give the agent a name a1 # Set the sources of a1 to be called r1 a1.sources = r1 # Set the sinks of a1 to be called k1 a1.sinks = k1 # Set the channels of a1 to be called c1 a1.channels = c1
Configure Source
# Set the type of r1 to exec, which is used to collect data generated by the command a1.sources.r1.type = exec # Set r1 to collect data generated by the tail -F command a1.sources.r1.command = tail -F /home/hadoop/tail-test.txt
Configuring Channel
- Two common types: MemoryChannel and FileChannel
# Set the type of c1 to memory a1.channels.c1.type = memory #Set the buffer capacity of c1 a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
Configuring Sink
# Set the type of k1 to hdfs a1.sinks.k1.type = hdfs # Set the k1 output path and create the corresponding directory on hdfs according to time a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S a1.sinks.k1.hdfs.filePrefix = events- #Set the data output by k1 to be saved as text a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text # Put the timestamp in the header of the data a1.sinks.k1.hdfs.useLocalTimeStamp = true
Assembling Source, Channel and Sink
# Set r1 to connect to c1 a1.sources.r1.channels = c1 # Set k1 to connect to c1 a1.sinks.k1.channel = c1
2.5 Flume usage
Understand the tail -F command
#(1) Enter the /home/hadoop directory and use the following command: cd /home/hadoop/ #(2) Create the touch tail-test.txt file and use the following command: touch tail-test.txt #(3) Append some content to the tail-test.txt file and use the following command: echo 'hello 11111' >> tail-test.txt echo 'hello 22222'>> tail-test.txt echo 'hello 33333'>> tail-test.txt #(4) To view the contents of the tail-test.txt file, use the following command: cat tail-test.txt #(5) Copy (newly open) an xshell window to monitor changes in the content of the tail-test.txt file, use the following command: tail -F tail-test.txt #(6) Return to the previous xshell window, continue to append some content to the tail-test.txt file, use the following command: echo 'hello 44444' >> tail-test.txt echo 'hello 55555'>> tail-test.txt echo 'hello 66666'>> tail-test.txt #Check whether the tail -F command monitors content changes
Use flume
Goal: Collect the new content in the tail-test.txt file to HDFS
# Use Flume to collect the new content in the tail-test.txt file to HDFS. #(1) Open a new xshell window, create the exec-memory-hdfs.properties file, and use the following command: touch exec-memory-hdfs.properties #(2) Edit the touch exec-memory-hdfs.properties file and fill in the following content: sudo vim exec-memory-hdfs.properties # Single node flume configuration file # Give the agent a name a1 # Set the sources of a1 to be called r1 a1.sources = r1 # Set the sinks of a1 to be called k1 a1.sinks = k1 # Set the channels of a1 to be called c1 a1.channels = c1 #Set the type of r1 to exec, which is used to collect data generated by the command a1.sources.r1.type = exec # Set r1 to collect data generated by the tail -F command a1.sources.r1.command = tail -F /home/hadoop/tail-test.txt # Set the type of c1 to memory a1.channels.c1.type = memory #Set the buffer capacity of c1 a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Set the type of k1 to hdfs a1.sinks.k1.type = hdfs # Set the k1 output path and create the corresponding directory on hdfs according to time a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S a1.sinks.k1.hdfs.filePrefix = events- #Set the data output by k1 to be saved as text a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text # Put the timestamp in the header of the data a1.sinks.k1.hdfs.useLocalTimeStamp = true # Set r1 to connect to c1 a1.sources.r1.channels = c1 # Set k1 to connect to c1 a1.sinks.k1.channel = c1
Start Flume
- Startthree nodeszookeeper
zkServer.sh start
- Start hdfs and yarn first
start-dfs.sh start-yarn.sh
- Start Flume
#Start Flume flume-ng agent -n a1 -c conf -f /home/hadoop/exec-memory-hdfs.properties
Verify flume
#(4) Append a large amount of data to the tail-test.txt file in the first xshell window echo 'hello 44444' >> tail-test.txt echo 'hello 55555' >> tail-test.txt echo 'hello 6666' >> tail-test.txt #2. In xshell, use the command hdfs dfs -cat /flume/events/ directory name/file name, you can see the data
- Go to the HDFS web monitoring page to see if data has been collected.
- You can see the newly generated directory
-
In xshell, use the command
hdfs dfs -cat /flume/events/ directory name/file name, you can see the data