ETL toolDatax-ETL-SqlServerToHDFS

Personal homepage–Personal homepage ?

? Thanks for the likes and attention, we will make a little progress every day! come on! ?

Table of Contents

Personal homepage–Personal homepage ?

1. Overview of DataX

1.1 Introduction to DataX

1.2 DataX framework

1.3 Functional limitations

1.4 Support Data Channels

2. Configuration example

2.1 Environmental information

2.2 SQLServer data synchronization to HDFS

2.2 Parameter description

1. DataX Overview

1.1 Introduction to DataX

DataX is a widely used offline data synchronization tool/platform within Alibaba Group, enabling efficient data synchronization between various heterogeneous data sources including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, ODPS, etc. Function.

GitHub – alibaba/DataX: DataX is the open source version of Alibaba Cloud DataWorks data integration.

1.2 DataX Framework

DataX itself, as a data synchronization framework, abstracts the synchronization of different data sources into a Reader plug-in that reads data from the source data source, and a Writer plug-in that writes data to the target. In theory, the DataX framework can support data synchronization of any data source type. Work. At the same time, the DataX plug-in system serves as an ecosystem. Every time a new data source is accessed, the newly added data source can interoperate with existing data sources.

DataX itself, as an offline data synchronization framework, is built using Framework + plugin architecture. Abstract data source reading and writing into Reader/Writer plug-ins and incorporate them into the entire synchronization framework.

Role	Function
Reader (acquisition module)	Responsible for collecting data from data sources and sending data to Framework.
Writer (writing module)	Responsible for continuously fetching data from the Framework and writing the data to the destination.
Framework (middleman)	Responsible for connecting Reader and Writer, serving as the data transmission channel between them, and handling core technical issues such as buffering, flow control, concurrency, and data conversion.

HdfsWriter provides the ability to write TEXTFILE files and ORCFile files to the specified path in the HDFS file system, and the file contents can be associated with Hive tables.

1.3 Functional Limitations

Currently HdfsWriter only supports files in two formats: textfile and orcfile, and the file content must be stored in a logical two-dimensional table;
Since HDFS is a file system and does not have the concept of schema, it does not support writing to some columns;
Currently only supports the following Hive data types:
Numeric type: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE

String type: STRING, VARCHAR, CHAR

Boolean type: BOOLEAN

Time type: DATE, TIMESTAMP

Currently not supported: decimal, binary, arrays, maps, structs, union types;

Hive partitioned tables currently only support writing to a single partition at a time;
For textfile, the user needs to ensure that the delimiter written to the hdfs file is consistent with the delimiter used when creating the table on Hive, so that the written hdfs data can be associated with the Hive table fields;

HdfsWriter implementation process is:

First, create a temporary directory that does not exist on the hdfs file system based on the path specified by the user. The creation rule is: path_random; then write the read files to this temporary directory; after all are written, the files in this temporary directory are Move to the user-specified directory (make sure the file names are not repeated when creating the file); finally delete the temporary directory. If a network interruption occurs during the intermediate process and the connection to HDFS cannot be established, the user needs to manually delete the files and temporary directories that have been written.

1.4 Support Data Channels

DataX currently has a relatively comprehensive plug-in system. Mainstream RDBMS databases, NOSQL, and big data computing systems have all been connected. The currently supported data is as shown below. For details, please click: DataX Data Source Reference Guide

Type	Data source	Reader	Writer(write)	Documentation
RDBMS relational database	MySQL	√	√	read ,Write
	Oracle	√	√	read ,Write
	OceanBase	√	√	read ,Write
	SQLServer	√	√	read ,Write
	PostgreSQL	√	√	read ,Write
	DRDS	√	√	read ,Write
	Kingbase	√	√	read ,Write
	Universal RDBMS (supports all relational databases)	√	√	read ,Write
Alibaba Cloud Data Warehouse Data Storage	ODPS	√	√	read ,Write
	ADB		√	Write
	ADS		√	Write
	OSS	√	√	read ,Write
	OCS		√	Write
	Hologres		√	Write
	AnalyticDB For PostgreSQL		√	Write
Alibaba Cloud middleware	datahub	√	√	read, write
	SLS	√	√	read, write
graph database	Alibaba Cloud GDB	√	√	read ,Write
	Neo4j		√	Write
NoSQL data storage	OTS	√	√	read ,Write
	Hbase0.94	√	√	read ,Write
	Hbase1.1	√	√	read ,Write
	Phoenix4.x	√	√	read ,Write
	Phoenix5.x	√	√	read ,Write
	MongoDB	√	√	read ,Write
	Cassandra	√	√	read ,Write
Data warehouse data storage	StarRocks	√	√	read, write
	ApacheDoris		√	Write
	ClickHouse	√	√	read ,Write
	Databend		√	Write
	Hive	√	√	read ,Write
	kudu		√	Write
	selectdb		√	Write
Unstructured data storage	TxtFile	√	√	read ,Write
	FTP	√	√	read ,Write
	HDFS	√	√	read ,Write
	Elasticsearch		√	Write
time series database	OpenTSDB	√		read
	TSDB	√	√	read ,Write
	TDengine	√	√	read ,Write

2. Configuration example

2.1 Environmental information

The cluster HDP version information is as follows:

2.2 SQLServer data synchronization to HDFS

site_traffic_inout.json configuration file

[winner_hdp@hdp104 yd]$ cat site_traffic_inout.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "sqlserverreader",
          "parameter": {
            "username": "$IPVA_WSHOP_USER",
            "password": "$IPVA_WSHOP_PASSWD",
            "connection": [
              {
                "jdbcUrl": ["$IPVA_URL"],
                "querySql": [
                  "SELECT StoreID,StoreName,SiteKey,SiteName from IPVA_WConfig.dbo.View_Site_Traffic_InOut;"
                ]
              }
            ]
          }
        },
        "writer": {
          "name": "hdfswriter",
          "parameter": {
            "column": [
              {"name":"StoreID" , "type":"string"},
              {"name":"StoreName" , "type":"string"},
              {"name":"SiteKey" , "type":"string"},
              {"name":"SiteName", "type":"string"}
            ],
            "path": "/winner/hadoop/winipva/wshop/tmp/",
            "defaultFS":"hdfs://winner",
            "encoding": "UTF-8",
            "fieldDelimiter": ",",
            "hadoopConfig":{
              "dfs.nameservices": "winner",
              "dfs.ha.namenodes.winner": "nn1,nn2",
              "dfs.namenode.rpc-address.winner.nn1": "hdp103:8020",
              "dfs.namenode.rpc-address.winner.nn2": "hdp104:8020",
              "dfs.client.failover.proxy.provider.winner": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
            },
            "haveKerberos": true,
            "kerberosKeytabFilePath": "/etc/security/keytabs/winner_hdp.keytab",
            "kerberosPrincipal": "[email protected]",
            "fileType": "text",
            "fileName": "000000",
            "writeMode": "nonConflict",
          }
        }
        
      }
    ],
    "setting": {
      "speed": {
        "channel": "5"
      },
      "errorLimit": {
        "record": 0
      }
    }
  }
}

run script

# !/bin/bash
#
#
# Script function: sqlServer data synchronization to HDFS
# Author: kangll
# Creation time: 2022-10-27
# Modification content: None
# Scheduling cycle:
# Script parameters: None
#
set -x
set -e
##datatime
date=`date + %Y%m%d`


## json config file path
json_src=/hadoop/datadir/windeploy/script/ETL/datax_json/
## datax path
data_py=/hadoop/software/datax/bin/datax.py
## ipva Wshop_config database connection
IPVA_USER=sa
IPVA_PASSWD='123456'
IPVA_URL="jdbc:sqlserver://192.168.2.103:1433;DatabaseName=IPVA_WConfig"

###
Site_ReID_InOut_Func() {
     ## Perform data synchronization
      python $data_py ${json_src}site_reid_inout.json -p "-DIPVA_USER=${IPVA_USER} -DIPVA_PASSWD=${IPVA_PASSWD} -DIPVA_URL=${IPVA_URL} -Ddate=${date}"
}
######################## main ######################### ######
main(){
     Site_ReID_InOut_Func
}
################################################ ###########
###
main

Synchronization successful

2.3 Parameter Description

defaultFS

Description: Hadoop hdfs file system namenode node address. Format: hdfs://ip:port; for example: hdfs://127.0.0.1:9000 Required: Yes Default value: None

fileType

Description: The type of file, currently only supports user configuration as “text” or “orc”. text represents textfile file format orc represents orcfile file format Required: Yes Default value: None

fileName

Description: The file name when HdfsWriter writes. During actual execution, a random suffix will be added to the file name as the actual file name written by each thread. Required: Yes Default: None

column

Description: Field for writing data. Writing to some columns is not supported. In order to associate with the table in hive, you need to specify all the field names and field types in the table, among which:
name specifies the field name and type specifies the field type.

Users can specify Column field information, the configuration is as follows:

 "column":
          [
                     {
                         "name": "userName","type": "string"
                     },
                     {
                         "name": "age","type": "long"
                     }
          ]

Required: Yes

Default: None

writeMode

Description: hdfswriter data cleaning processing mode before writing: append, no processing is done before writing, DataX hdfswriter directly uses filename to write, and ensures that the file names do not conflict. nonConflict, if there is a file with fileName prefix in the directory, an error will be reported directly. Required: Yes Default: None

fieldDelimiter

Description: The field delimiter when hdfswriter writes, the user needs to ensure that it is consistent with the field delimiter of the created Hive table, otherwise the data cannot be found in the Hive table. Required: Yes Default value: None

compress

Description: HDFS file compression type, not filled in by default means no compression. Among them: The compression types supported by text type files are gzip and bzip2; the compression types supported by orc type files are NONE and SNAPPY (users are required to install SnappyCodec). Required: No Default: No compression

hadoopConfig

Description: Some advanced parameters related to Hadoop can be configured in hadoopConfig, such as HA configuration.

 "hadoopConfig":{
         "dfs.nameservices": "testDfs",
         "dfs.ha.namenodes.testDfs": "namenode1,namenode2",
         "dfs.namenode.rpc-address.aliDfs.namenode1": "",
         "dfs.namenode.rpc-address.aliDfs.namenode2": "",
         "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
 }

encoding

Description: Encoding configuration for writing files. Required: No Default value: utf-8, modify carefully

haveKerberos

Description: Whether there is Kerberos authentication, the default is false. For example, if the user configures true, the configuration items kerberosKeytabFilePath and kerberosPrincipal are required. Required: haveKerberos is required if true Default value: false

kerberosKeytabFilePath

Description: Kerberos authentication keytab file path, absolute path Required: No Default value: None

kerberosPrincipal

Description: Kerberos authentication Principal name, such as xxxx/[email protected] Required: haveKerberos is required if true Default value: None

"haveKerberos": true,
"kerberosKeytabFilePath": "/etc/security/keytabs/winner_hdp.keytab",
"kerberosPrincipal": "[email protected]",