ETL toolDatax-ETL-SqlServerToHDFS

Personal homepage–Personal homepage ?

? Thanks for the likes and attention, we will make a little progress every day! come on! ?

Table of Contents

Personal homepage–Personal homepage ?

1. Overview of DataX

1.1 Introduction to DataX

1.2 DataX framework

1.3 Functional limitations

1.4 Support Data Channels

2. Configuration example

2.1 Environmental information

2.2 SQLServer data synchronization to HDFS

2.2 Parameter description


1. DataX Overview


1.1 Introduction to DataX


DataX is a widely used offline data synchronization tool/platform within Alibaba Group, enabling efficient data synchronization between various heterogeneous data sources including MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS, ODPS, etc. Function.

GitHub – alibaba/DataX: DataX is the open source version of Alibaba Cloud DataWorks data integration.

1.2 DataX Framework


DataX itself, as a data synchronization framework, abstracts the synchronization of different data sources into a Reader plug-in that reads data from the source data source, and a Writer plug-in that writes data to the target. In theory, the DataX framework can support data synchronization of any data source type. Work. At the same time, the DataX plug-in system serves as an ecosystem. Every time a new data source is accessed, the newly added data source can interoperate with existing data sources.

DataX itself, as an offline data synchronization framework, is built using Framework + plugin architecture. Abstract data source reading and writing into Reader/Writer plug-ins and incorporate them into the entire synchronization framework.

Role

Function

Reader (acquisition module)

Responsible for collecting data from data sources and sending data to Framework.

Writer (writing module)

Responsible for continuously fetching data from the Framework and writing the data to the destination.

Framework (middleman)

Responsible for connecting Reader and Writer, serving as the data transmission channel between them, and handling core technical issues such as buffering, flow control, concurrency, and data conversion.

HdfsWriter provides the ability to write TEXTFILE files and ORCFile files to the specified path in the HDFS file system, and the file contents can be associated with Hive tables.

1.3 Functional Limitations

  • Currently HdfsWriter only supports files in two formats: textfile and orcfile, and the file content must be stored in a logical two-dimensional table;
  • Since HDFS is a file system and does not have the concept of schema, it does not support writing to some columns;
  • Currently only supports the following Hive data types:
    Numeric type: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE

String type: STRING, VARCHAR, CHAR

Boolean type: BOOLEAN

Time type: DATE, TIMESTAMP

Currently not supported: decimal, binary, arrays, maps, structs, union types;

  • Hive partitioned tables currently only support writing to a single partition at a time;
  • For textfile, the user needs to ensure that the delimiter written to the hdfs file is consistent with the delimiter used when creating the table on Hive, so that the written hdfs data can be associated with the Hive table fields;

HdfsWriter implementation process is:

First, create a temporary directory that does not exist on the hdfs file system based on the path specified by the user. The creation rule is: path_random; then write the read files to this temporary directory; after all are written, the files in this temporary directory are Move to the user-specified directory (make sure the file names are not repeated when creating the file); finally delete the temporary directory. If a network interruption occurs during the intermediate process and the connection to HDFS cannot be established, the user needs to manually delete the files and temporary directories that have been written.

1.4 Support Data Channels


DataX currently has a relatively comprehensive plug-in system. Mainstream RDBMS databases, NOSQL, and big data computing systems have all been connected. The currently supported data is as shown below. For details, please click: DataX Data Source Reference Guide

Type

Data source

Reader

Writer(write)

Documentation

RDBMS relational database

MySQL

read

,Write

Oracle

read

,Write

OceanBase

read

,Write

SQLServer

read

,Write

PostgreSQL

read

,Write

DRDS

read

,Write

Kingbase

read

,Write

Universal RDBMS (supports all relational databases)

read

,Write

Alibaba Cloud Data Warehouse Data Storage

ODPS

read

,Write

ADB

Write

ADS

Write

OSS

read

,Write

OCS

Write

Hologres

Write

AnalyticDB For PostgreSQL

Write

Alibaba Cloud middleware

datahub

read, write

SLS

read, write

graph database

Alibaba Cloud GDB

read

,Write

Neo4j

Write

NoSQL data storage

OTS

read

,Write

Hbase0.94

read

,Write

Hbase1.1

read

,Write

Phoenix4.x

read

,Write

Phoenix5.x

read

,Write

MongoDB

read

,Write

Cassandra

read

,Write

Data warehouse data storage

StarRocks

read, write

ApacheDoris

Write

ClickHouse

read

,Write

Databend

Write

Hive

read

,Write

kudu

Write

selectdb

Write

Unstructured data storage

TxtFile

read

,Write

FTP

read

,Write

HDFS

read

,Write

Elasticsearch

Write

time series database

OpenTSDB

read

TSDB

read

,Write

TDengine

read

,Write


2. Configuration example


2.1 Environmental information


The cluster HDP version information is as follows:

2.2 SQLServer data synchronization to HDFS


site_traffic_inout.json configuration file

[winner_hdp@hdp104 yd]$ cat site_traffic_inout.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "sqlserverreader",
          "parameter": {
            "username": "$IPVA_WSHOP_USER",
            "password": "$IPVA_WSHOP_PASSWD",
            "connection": [
              {
                "jdbcUrl": ["$IPVA_URL"],
                "querySql": [
                  "SELECT StoreID,StoreName,SiteKey,SiteName from IPVA_WConfig.dbo.View_Site_Traffic_InOut;"
                ]
              }
            ]
          }
        },
        "writer": {
          "name": "hdfswriter",
          "parameter": {
            "column": [
              {"name":"StoreID" , "type":"string"},
              {"name":"StoreName" , "type":"string"},
              {"name":"SiteKey" , "type":"string"},
              {"name":"SiteName", "type":"string"}
            ],
            "path": "/winner/hadoop/winipva/wshop/tmp/",
            "defaultFS":"hdfs://winner",
            "encoding": "UTF-8",
            "fieldDelimiter": ",",
            "hadoopConfig":{
              "dfs.nameservices": "winner",
              "dfs.ha.namenodes.winner": "nn1,nn2",
              "dfs.namenode.rpc-address.winner.nn1": "hdp103:8020",
              "dfs.namenode.rpc-address.winner.nn2": "hdp104:8020",
              "dfs.client.failover.proxy.provider.winner": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
            },
            "haveKerberos": true,
            "kerberosKeytabFilePath": "/etc/security/keytabs/winner_hdp.keytab",
            "kerberosPrincipal": "[email protected]",
            "fileType": "text",
            "fileName": "000000",
            "writeMode": "nonConflict",
          }
        }
        
      }
    ],
    "setting": {
      "speed": {
        "channel": "5"
      },
      "errorLimit": {
        "record": 0
      }
    }
  }
}

run script

# !/bin/bash
#
#
# Script function: sqlServer data synchronization to HDFS
# Author: kangll
# Creation time: 2022-10-27
# Modification content: None
# Scheduling cycle:
# Script parameters: None
#
set -x
set -e
##datatime
date=`date + %Y%m%d`


## json config file path
json_src=/hadoop/datadir/windeploy/script/ETL/datax_json/
## datax path
data_py=/hadoop/software/datax/bin/datax.py
## ipva Wshop_config database connection
IPVA_USER=sa
IPVA_PASSWD='123456'
IPVA_URL="jdbc:sqlserver://192.168.2.103:1433;DatabaseName=IPVA_WConfig"

###
Site_ReID_InOut_Func() {
     ## Perform data synchronization
      python $data_py ${json_src}site_reid_inout.json -p "-DIPVA_USER=${IPVA_USER} -DIPVA_PASSWD=${IPVA_PASSWD} -DIPVA_URL=${IPVA_URL} -Ddate=${date}"
}
######################## main ######################### ######
main(){
     Site_ReID_InOut_Func
}
################################################ ###########
###
main

Synchronization successful

2.3 Parameter Description


  • defaultFS

Description: Hadoop hdfs file system namenode node address. Format: hdfs://ip:port; for example: hdfs://127.0.0.1:9000 Required: Yes Default value: None

  • fileType

Description: The type of file, currently only supports user configuration as “text” or “orc”. text represents textfile file format orc represents orcfile file format Required: Yes Default value: None

  • fileName

Description: The file name when HdfsWriter writes. During actual execution, a random suffix will be added to the file name as the actual file name written by each thread. Required: Yes Default: None

  • column
Description: Field for writing data. Writing to some columns is not supported. In order to associate with the table in hive, you need to specify all the field names and field types in the table, among which:
name specifies the field name and type specifies the field type.

Users can specify Column field information, the configuration is as follows:

 "column":
          [
                     {
                         "name": "userName","type": "string"
                     },
                     {
                         "name": "age","type": "long"
                     }
          ]

Required: Yes

Default: None 
  • writeMode

Description: hdfswriter data cleaning processing mode before writing: append, no processing is done before writing, DataX hdfswriter directly uses filename to write, and ensures that the file names do not conflict. nonConflict, if there is a file with fileName prefix in the directory, an error will be reported directly. Required: Yes Default: None

  • fieldDelimiter

Description: The field delimiter when hdfswriter writes, the user needs to ensure that it is consistent with the field delimiter of the created Hive table, otherwise the data cannot be found in the Hive table. Required: Yes Default value: None

  • compress

Description: HDFS file compression type, not filled in by default means no compression. Among them: The compression types supported by text type files are gzip and bzip2; the compression types supported by orc type files are NONE and SNAPPY (users are required to install SnappyCodec). Required: No Default: No compression

  • hadoopConfig
Description: Some advanced parameters related to Hadoop can be configured in hadoopConfig, such as HA configuration.

 "hadoopConfig":{
         "dfs.nameservices": "testDfs",
         "dfs.ha.namenodes.testDfs": "namenode1,namenode2",
         "dfs.namenode.rpc-address.aliDfs.namenode1": "",
         "dfs.namenode.rpc-address.aliDfs.namenode2": "",
         "dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
 }
  • encoding

Description: Encoding configuration for writing files. Required: No Default value: utf-8, modify carefully

  • haveKerberos

Description: Whether there is Kerberos authentication, the default is false. For example, if the user configures true, the configuration items kerberosKeytabFilePath and kerberosPrincipal are required. Required: haveKerberos is required if true Default value: false

  • kerberosKeytabFilePath

Description: Kerberos authentication keytab file path, absolute path Required: No Default value: None

  • kerberosPrincipal

Description: Kerberos authentication Principal name, such as xxxx/[email protected] Required: haveKerberos is required if true Default value: None

"haveKerberos": true,
"kerberosKeytabFilePath": "/etc/security/keytabs/winner_hdp.keytab",
"kerberosPrincipal": "[email protected]",