Sqoop of big data technology

Chapter 1 Introduction to Sqoop

Sqoop is an open source tool, mainly used for data transfer between Hadoop (Hive) and traditional databases (mysql, postgresql…), which can transfer data from a relational database (such as MySQL, Oracle, Postgres, etc.) The data is imported into Hadoop’s HDFS, and HDFS data can also be imported into a relational database.
The Sqoop project started in 2009. It first existed as a third-party module of Hadoop. Later, in order to allow users to quickly deploy and developers to iteratively develop faster, Sqoop became an Apache project independently.

The latest version of Sqoop2 is 1.99.7. Note that 2 is not compatible with 1, is not fully featured, and is not intended for production deployment.

Chapter 2 Principles of Sqoop

Implemented by translating import or export commands into mapreduce programs.
In the translated mapreduce, the inputformat and outputformat are mainly customized.

Chapter 3 Sqoop Installation

The prerequisite for installing Sqoop is that you already have Java and Hadoop environments.
3.1 Download and unzip

  1. Download address: http://mirrors.hust.edu.cn/apache/sqoop/1.4.6/
  2. Upload the installation package sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz to the virtual machine
  3. Unzip the sqoop installation package to a specified directory, such as:
$ tar -zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/

3.2 Modify the configuration file
The configuration file of Sqoop is similar to most big data frameworks, in the conf directory under the sqoop root directory.

  1. rename configuration file
$ mv sqoop-env-template.sh sqoop-env.sh
  1. Modify the configuration file
sqoop-env.sh
export HADOOP_COMMON_HOME=/opt/module/hadoop-2.7.2
export HADOOP_MAPRED_HOME=/opt/module/hadoop-2.7.2
export HIVE_HOME=/opt/module/hive
export ZOOKEEPER_HOME=/opt/module/zookeeper-3.4.10
export ZOOCFGDIR=/opt/module/zookeeper-3.4.10
export HBASE_HOME=/opt/module/hbase

3.3 Copy JDBC driver
Copy the jdbc driver to the lib directory of sqoop, such as:

$ cp mysql-connector-java-5.1.27-bin.jar /opt/module/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib/

3.4 Verify Sqoop
We can verify whether the sqoop configuration is correct through a certain command:

$ bin/sqoop help

Some Warning warnings appear (warning information has been omitted), accompanied by the output of the help command:

Available commands:
  codegen Generate code to interact with database records
  create-hive-table Import a table definition into Hive
  eval Evaluate a SQL statement and display the results
  export Export an HDFS directory to a database table
  help List available commands
  import Import a table from a database to HDFS
  import-all-tables Import tables from a database to HDFS
  import-mainframe Import datasets from a mainframe server to HDFS
  job Work with saved jobs
  list-databases List available databases on a server
  list-tables List available tables in a database
  merge Merge results of incremental imports
  metastore Run a standalone Sqoop metastore
  version Display version information

3.5 Test whether Sqoop can successfully connect to the database

$ bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 000000

The following output appears:

information_schema
metastore
mysql
oozie
performance_schema

Chapter 4 Simple Use Cases of Sqoop

4.1 Import data
In Sqoop, the concept of “import” refers to: transferring data from a non-big data cluster (RDBMS) to a big data cluster (HDFS, HIVE, HBASE), called: import, that is, using the import keyword.
4.1.1 RDBMS to HDFS

  1. Make sure the Mysql service is enabled normally
  2. Create a new table in Mysql and insert some data
$ mysql -uroot -p000000
mysql> create database company;
mysql> create table company.staff(id int(4) primary key not null auto_increment, name varchar(255), sex varchar(255));
mysql> insert into company.staff(name, sex) values('Thomas', 'Male');
mysql> insert into company.staff(name, sex) values('Catalina', 'FeMale');
  1. Import Data
    (1) Import all
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff\
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"
 (2) query import
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query 'select name,sex from staff where id <=1 and $CONDITIONS;'

Tip: must contain '$CONDITIONS' in WHERE clause.
If double quotes are used after the query, a shift character must be added before $CONDITIONS to prevent the shell from recognizing it as its own variable.
(3) Import the specified column

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 000000 \
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns id,sex \
--table staff

Tip: If multiple columns are involved in columns, separate them with commas, and do not add spaces when separating them
(4) Use the sqoop keyword to filter and query the imported data

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--target-dir /user/company \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--table staff\
--where "id=1"

4.1.2 RDBMS to Hive

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff\
--num-mappers 1 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table staff_hive

Tip: This process is divided into two steps. The first step is to import the data into HDFS. The second step is to migrate the data imported into HDFS to the Hive warehouse. The default temporary directory for the first step is /user/atguigu/table name
4.1.3 RDBMS to Hbase

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table company \
--columns "id,name,sex" \
--column-family "info" \
--hbase-create-table \
--hbase-row-key "id" \
--hbase-table "hbase_company" \
--num-mappers 1 \
--split-by-id

Tip: sqoop1.4.6 only supports the function of automatically creating HBase tables in versions earlier than HBase1.0.1
Solution: Manually create HBase tables

hbase> create 'hbase_company,'info'

(5) Scan this table in HBase to get the following content

hbase> scan 'hbase_company'

4.2. Export data
In Sqoop, the concept of “export” refers to: transferring data from a big data cluster (HDFS, HIVE, HBASE) to a non-big data cluster (RDBMS), called: export, that is, using the export keyword.
4.2.1 HIVE/HDFS to RDBMS

$ bin/sqoop export \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff\
--num-mappers 1 \
--export-dir /user/hive/warehouse/staff_hive \
--input-fields-terminated-by "\t"

Tip: If the table does not exist in Mysql, it will not be created automatically
4.3 Script packaging
Package the sqoop command with a file in opt format, and then execute

  1. Create a .opt file
$ mkdir opt
$ touch opt/job_HDFS2RDBMS.opt
  1. Write sqoop script
$ vi opt/job_HDFS2RDBMS.opt
export
--connect
jdbc:mysql://hadoop102:3306/company
--username
root
--password
000000
--table
staff
--num-mappers
1
--export-dir
/user/hive/warehouse/staff_hive
--input-fields-terminated-by
"\t"
  1. execute the script
$ bin/sqoop --options-file opt/job_HDFS2RDBMS.opt

Chapter 5 Some common commands and parameters of Sqoop

5.1 List of common commands
Here is a list of some commonly used parameters in Sqoop operation for your reference. If you need to learn more, you can refer to the source code of the corresponding class.


5.2 Command & amp; Detailed Explanation of Parameters
I just listed some common commands of Sqoop. For different commands, there are different parameters. Let us list them one by one.
First of all, let’s introduce the common parameters. The so-called common parameters are the parameters supported by most commands.
5.2.1 Common parameters: database connection

5.2.2 Common parameters: import

5.2.3 Common parameters: export

5.2.4 Common parameters: hive

After the public parameters are introduced, let’s introduce the specific parameters corresponding to the commands according to the commands.
5.2.5 Command & amp; parameter: import
Import the data in the relational database into HDFS (including Hive, HBase). If Hive is imported, if there is no corresponding table in Hive, it will be automatically created.

  1. Order:
    Such as: import data into hive
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 000000 \
--table staff \
--hive-import

For example: incrementally import data into hive, mode=append
append import:

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff\
--num-mappers 1 \
--fields-terminated-by "\t" \
--target-dir /user/hive/warehouse/staff_hive \
--check-column id \
--incremental append \
--last-value 3

Tip: append cannot be used together with parameters such as –hive- (Append mode for hive imports is not yet supported. Please remove the parameter –append-mode)
For example: incrementally import data to hdfs, mode=lastmodified

First build a table in mysql and insert a few pieces of data:

mysql> create table company.staff_timestamp(id int(4), name varchar(255), sex varchar(255), last_modified timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);
mysql> insert into company.staff_timestamp (id, name, sex) values(1, 'AAA', 'female');
mysql> insert into company.staff_timestamp (id, name, sex) values(2, 'BBB', 'female');

Import some data first:

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff_timestamp \
--delete-target-dir \
--m 1

Import some data incrementally:

mysql> insert into company.staff_timestamp (id, name, sex) values(3, 'CCC', 'female');
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff_timestamp \
--check-column last_modified \
--incremental lastmodified \
--last-value "2017-09-28 22:20:38" \
--m 1 \
--append

Tip: When using the lastmodified method to import data, you need to specify whether the incremental data should be –append (append) or –merge-key (merge)
Tip: The value specified by last-value will be included in the incrementally imported data
2) Parameters:


5.2.6 Command & amp; parameter: export
Export winning data from HDFS (including Hive and HBase) to a relational database.

  1. Order:
    like:
$ bin/sqoop export \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 000000\
--table staff \
--export-dir /user/company\
--input-fields-terminated-by "\t" \
--num-mappers 1
  1. parameter:

    5.2.7 Command & parameter: codegen
    Map the table in the relational database to a Java class, in which there are fields corresponding to each column.
    like:
$ bin/sqoop codegen \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff\
--bindir /home/admin/Desktop/staff \
--class-name Staff\
--fields-terminated-by "\t"



5.2.8 Command & amp; parameter: create-hive-table
Generate the hive table structure corresponding to the relational database table structure.
Order:
like:

$ bin/sqoop create-hive-table \
--connect jdbc:mysql://hadoop102:3306/company \
--username root\
--password 000000\
--table staff\
--hive-table hive_staff

parameter:

5.2.9 Command & amp; parameter: eval
You can quickly use SQL statements to operate relational databases. It is often used to check whether the SQL statements are correct and the data is normal before importing data, and display the results on the console.
Order:
like:

$ bin/sqoop eval \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 000000 \
--query "SELECT * FROM staff"


5.2.10 Command & amp; parameter: import-all-tables
All tables in RDBMS can be imported into HDFS, and each table corresponds to an HDFS directory
Order:
like:

$ bin/sqoop import-all-tables \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 000000 \
--warehouse-dir /all_tables

![Insert picture description here](https://img-blog.csdnimg.cn/6
5.2.11 Command & amp; parameter: job
It is used to generate a sqoop task, which will not be executed immediately after generation and needs to be executed manually.
Order:
like:

$ bin/sqoop job \
 --create myjob --import-all-tables \
 --connect jdbc:mysql://hadoop102:3306/company \
 --username root\
 --password 000000
$ bin/sqoop job \
--list
$ bin/sqoop job \
--exec myjob

Tip: Note that there is a space between import-all-tables and the – on its left
Tip: If you need to connect to the metastore, then –meta-connect jdbc:hsqldb:hsql://linux01:16000/sqoop
parameter:

Tip: When executing a job, if you need to manually enter the database password, you can do the following optimization

<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
<description>If true, allow saved passwords in the metastore.</description>
</property>

5.2.12 Command & amp; parameter: list-databases
Order:
like:

$ bin/sqoop list-databases \
--connect jdbc:mysql://hadoop102:3306/ \
--username root \
--password 000000

Parameters: Same as public parameters
5.2.13 Command & amp; parameter: list-tables
Order:
like:

$ bin/sqoop list-tables \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \

–password 000000
Parameters: Same as public parameters
5.2.14 Command & amp; parameter: merge
Merge the data under different directories in HDFS and put them into the specified directory
Data environment:

new_staff
1 AAA male
2 BBB males
3 CCC male
4 DDD male
old_staff
1 AAA female
2 CCC female
3 BBB female
6 DDD female

Tip: The separator between the columns of the above data should be \t, and the separator between rows should be \
. If you copy directly, please check it.
Order:
like:
Create JavaBean:

$ bin/sqoop codegen \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 000000 \
--table staff \
--bindir /home/admin/Desktop/staff\
--class-name Staff \
--fields-terminated-by "\t"

Start the merge:

$ bin/sqoop merge \
--new-data /test/new/ \
--onto /test/old/ \
--target-dir /test/merged \
--jar-file /home/admin/Desktop/staff/Staff.jar \
--class-name Staff\
--merge-key-id
Result:
1 AAA MALE
2 BBB MALE
3 CCC MALE
4 DDD MALE
6 DDD FEMALE

parameter:

5.2.15 Command & amp; parameter: metastore
The metadata information of the Sqoop job is recorded. If the service is not started, the default job metadata storage directory is ~/.sqoop, which can be modified in sqoop-site.xml.
Order:
Such as: start the metastore service of sqoop

$ bin/sqoop metastore

parameter:
No. Parameter Description

1 --shutdown close metastore