Personal homepage–Personal homepage ?
? Thanks for the likes and attention, we will make a little progress every day! come on! ?
Table of Contents
Personal homepage–Personal homepage ?
1. Overview of DataX
1.1 Introduction to DataX
1.2 DataX framework
1.3 Functional limitations
1.4 Support Data Channels
2. Configuration example
2.1 Environmental information
2.2 SQLServer data synchronization to HDFS
2.2 Parameter description
1. DataX Overview
1.1 Introduction to DataX
DataX is a widely used offline data synchronization tool/platform within Alibaba Group, enabling efficient data synchronization between various heterogeneous data sources including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, ODPS, etc. Function.
GitHub – alibaba/DataX: DataX is the open source version of Alibaba Cloud DataWorks data integration.
1.2 DataX Framework
DataX itself, as a data synchronization framework, abstracts the synchronization of different data sources into a Reader plug-in that reads data from the source data source, and a Writer plug-in that writes data to the target. In theory, the DataX framework can support data synchronization of any data source type. Work. At the same time, the DataX plug-in system serves as an ecosystem. Every time a new data source is accessed, the newly added data source can interoperate with existing data sources.
DataX itself, as an offline data synchronization framework, is built using Framework + plugin architecture. Abstract data source reading and writing into Reader/Writer plug-ins and incorporate them into the entire synchronization framework.
Role |
Function |
Reader (acquisition module) |
Responsible for collecting data from data sources and sending data to Framework. |
Writer (writing module) |
Responsible for continuously fetching data from the Framework and writing the data to the destination. |
Framework (middleman) |
Responsible for connecting Reader and Writer, serving as the data transmission channel between them, and handling core technical issues such as buffering, flow control, concurrency, and data conversion. |
HdfsWriter provides the ability to write TEXTFILE files and ORCFile files to the specified path in the HDFS file system, and the file contents can be associated with Hive tables.
1.3 Functional Limitations
- Currently HdfsWriter only supports files in two formats: textfile and orcfile, and the file content must be stored in a logical two-dimensional table;
- Since HDFS is a file system and does not have the concept of schema, it does not support writing to some columns;
- Currently only supports the following Hive data types:
Numeric type: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE
String type: STRING, VARCHAR, CHAR
Boolean type: BOOLEAN
Time type: DATE, TIMESTAMP
Currently not supported: decimal, binary, arrays, maps, structs, union types;
- Hive partitioned tables currently only support writing to a single partition at a time;
- For textfile, the user needs to ensure that the delimiter written to the hdfs file is consistent with the delimiter used when creating the table on Hive, so that the written hdfs data can be associated with the Hive table fields;
HdfsWriter implementation process is:
First, create a temporary directory that does not exist on the hdfs file system based on the path specified by the user. The creation rule is: path_random; then write the read files to this temporary directory; after all are written, the files in this temporary directory are Move to the user-specified directory (make sure the file names are not repeated when creating the file); finally delete the temporary directory. If a network interruption occurs during the intermediate process and the connection to HDFS cannot be established, the user needs to manually delete the files and temporary directories that have been written.
1.4 Support Data Channels
DataX currently has a relatively comprehensive plug-in system. Mainstream RDBMS databases, NOSQL, and big data computing systems have all been connected. The currently supported data is as shown below. For details, please click: DataX Data Source Reference Guide
Type |
Data source |
Reader |
Writer(write) |
Documentation |
RDBMS relational database |
MySQL |
√ |
√ |
read ,Write |
Oracle |
√ |
√ |
read ,Write |
|
OceanBase |
√ |
√ |
read ,Write |
|
SQLServer |
√ |
√ |
read ,Write |
|
PostgreSQL |
√ |
√ |
read ,Write |
|
DRDS |
√ |
√ |
read ,Write |
|
Kingbase |
√ |
√ |
read ,Write |
|
Universal RDBMS (supports all relational databases) |
√ |
√ |
read ,Write |
|
Alibaba Cloud Data Warehouse Data Storage |
ODPS |
√ |
√ |
read ,Write |
ADB |
√ |
Write |
||
ADS |
√ |
Write |
||
OSS |
√ |
√ |
read ,Write |
|
OCS |
√ |
Write |
||
Hologres |
√ |
Write |
||
AnalyticDB For PostgreSQL |
√ |
Write |
||
Alibaba Cloud middleware |
datahub |
√ |
√ |
read, write |
SLS |
√ |
√ |
read, write |
|
graph database |
Alibaba Cloud GDB |
√ |
√ |
read ,Write |
Neo4j |
√ |
Write |
||
NoSQL data storage |
OTS |
√ |
√ |
read ,Write |
Hbase0.94 |
√ |
√ |
read ,Write |
|
Hbase1.1 |
√ |
√ |
read ,Write |
|
Phoenix4.x |
√ |
√ |
read ,Write |
|
Phoenix5.x |
√ |
√ |
read ,Write |
|
MongoDB |
√ |
√ |
read ,Write |
|
Cassandra |
√ |
√ |
read ,Write |
|
Data warehouse data storage |
StarRocks |
√ |
√ |
read, write |
ApacheDoris |
√ |
Write |
||
ClickHouse |
√ |
√ |
read ,Write |
|
Databend |
√ |
Write |
||
Hive |
√ |
√ |
read ,Write |
|
kudu |
√ |
Write |
||
selectdb |
√ |
Write |
||
Unstructured data storage |
TxtFile |
√ |
√ |
read ,Write |
FTP |
√ |
√ |
read ,Write |
|
HDFS |
√ |
√ |
read ,Write |
|
Elasticsearch |
√ |
Write |
||
time series database |
OpenTSDB |
√ |
read |
|
TSDB |
√ |
√ |
read ,Write |
|
TDengine |
√ |
√ |
read ,Write |
2. Configuration example
2.1 Environmental information
The cluster HDP version information is as follows:
2.2 SQLServer data synchronization to HDFS
site_traffic_inout.json configuration file
[winner_hdp@hdp104 yd]$ cat site_traffic_inout.json { "job": { "content": [ { "reader": { "name": "sqlserverreader", "parameter": { "username": "$IPVA_WSHOP_USER", "password": "$IPVA_WSHOP_PASSWD", "connection": [ { "jdbcUrl": ["$IPVA_URL"], "querySql": [ "SELECT StoreID,StoreName,SiteKey,SiteName from IPVA_WConfig.dbo.View_Site_Traffic_InOut;" ] } ] } }, "writer": { "name": "hdfswriter", "parameter": { "column": [ {"name":"StoreID" , "type":"string"}, {"name":"StoreName" , "type":"string"}, {"name":"SiteKey" , "type":"string"}, {"name":"SiteName", "type":"string"} ], "path": "/winner/hadoop/winipva/wshop/tmp/", "defaultFS":"hdfs://winner", "encoding": "UTF-8", "fieldDelimiter": ",", "hadoopConfig":{ "dfs.nameservices": "winner", "dfs.ha.namenodes.winner": "nn1,nn2", "dfs.namenode.rpc-address.winner.nn1": "hdp103:8020", "dfs.namenode.rpc-address.winner.nn2": "hdp104:8020", "dfs.client.failover.proxy.provider.winner": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" }, "haveKerberos": true, "kerberosKeytabFilePath": "/etc/security/keytabs/winner_hdp.keytab", "kerberosPrincipal": "[email protected]", "fileType": "text", "fileName": "000000", "writeMode": "nonConflict", } } } ], "setting": { "speed": { "channel": "5" }, "errorLimit": { "record": 0 } } } }
run script
# !/bin/bash # # # Script function: sqlServer data synchronization to HDFS # Author: kangll # Creation time: 2022-10-27 # Modification content: None # Scheduling cycle: # Script parameters: None # set -x set -e ##datatime date=`date + %Y%m%d` ## json config file path json_src=/hadoop/datadir/windeploy/script/ETL/datax_json/ ## datax path data_py=/hadoop/software/datax/bin/datax.py ## ipva Wshop_config database connection IPVA_USER=sa IPVA_PASSWD='123456' IPVA_URL="jdbc:sqlserver://192.168.2.103:1433;DatabaseName=IPVA_WConfig" ### Site_ReID_InOut_Func() { ## Perform data synchronization python $data_py ${json_src}site_reid_inout.json -p "-DIPVA_USER=${IPVA_USER} -DIPVA_PASSWD=${IPVA_PASSWD} -DIPVA_URL=${IPVA_URL} -Ddate=${date}" } ######################## main ######################### ###### main(){ Site_ReID_InOut_Func } ################################################ ########### ### main
Synchronization successful
2.3 Parameter Description
- defaultFS
Description: Hadoop hdfs file system namenode node address. Format: hdfs://ip:port; for example: hdfs://127.0.0.1:9000 Required: Yes Default value: None
- fileType
Description: The type of file, currently only supports user configuration as “text” or “orc”. text represents textfile file format orc represents orcfile file format Required: Yes Default value: None
- fileName
Description: The file name when HdfsWriter writes. During actual execution, a random suffix will be added to the file name as the actual file name written by each thread. Required: Yes Default: None
- column
Description: Field for writing data. Writing to some columns is not supported. In order to associate with the table in hive, you need to specify all the field names and field types in the table, among which: name specifies the field name and type specifies the field type. Users can specify Column field information, the configuration is as follows: "column": [ { "name": "userName","type": "string" }, { "name": "age","type": "long" } ] Required: Yes Default: None
- writeMode
Description: hdfswriter data cleaning processing mode before writing: append, no processing is done before writing, DataX hdfswriter directly uses filename to write, and ensures that the file names do not conflict. nonConflict, if there is a file with fileName prefix in the directory, an error will be reported directly. Required: Yes Default: None
- fieldDelimiter
Description: The field delimiter when hdfswriter writes, the user needs to ensure that it is consistent with the field delimiter of the created Hive table, otherwise the data cannot be found in the Hive table. Required: Yes Default value: None
- compress
Description: HDFS file compression type, not filled in by default means no compression. Among them: The compression types supported by text type files are gzip and bzip2; the compression types supported by orc type files are NONE and SNAPPY (users are required to install SnappyCodec). Required: No Default: No compression
- hadoopConfig
Description: Some advanced parameters related to Hadoop can be configured in hadoopConfig, such as HA configuration. "hadoopConfig":{ "dfs.nameservices": "testDfs", "dfs.ha.namenodes.testDfs": "namenode1,namenode2", "dfs.namenode.rpc-address.aliDfs.namenode1": "", "dfs.namenode.rpc-address.aliDfs.namenode2": "", "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" }
- encoding
Description: Encoding configuration for writing files. Required: No Default value: utf-8, modify carefully
- haveKerberos
Description: Whether there is Kerberos authentication, the default is false. For example, if the user configures true, the configuration items kerberosKeytabFilePath and kerberosPrincipal are required. Required: haveKerberos is required if true Default value: false
- kerberosKeytabFilePath
Description: Kerberos authentication keytab file path, absolute path Required: No Default value: None
- kerberosPrincipal
Description: Kerberos authentication Principal name, such as xxxx/[email protected] Required: haveKerberos is required if true Default value: None
"haveKerberos": true, "kerberosKeytabFilePath": "/etc/security/keytabs/winner_hdp.keytab", "kerberosPrincipal": "[email protected]",