0 Introduction
Apache Hudi (Hadoop Upserts Delete and Incremental) is a next-generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly into the data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexing, streaming ingestion services, data clustering/compression optimization and concurrency, while maintaining the open source file format of the data.
Not only is Apache Hudi well suited for streaming workloads, but it also allows for the creation of efficient incremental batch processing pipelines.
Apache Hudi can be easily used on any cloud storage platform. Hudi’s advanced performance optimizations make analytics workloads faster for any popular query engine, including Apache Spark, Flink, Presto, Trino, Hive, and more.
1 Environment preparation
1.1 install maven
Requires maven version 3.3.1 and above
Modify setting.xml and specify it as the Ali warehouse address
<!-- Add Alibaba Cloud mirror --> <mirror> <id>nexus-aliyun</id> <mirrorOf>central</mirrorOf> <name>Nexus aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </mirror>
1.2 Install JDK
Requires jdk8 and above version
1.3 Git
Mainly used to obtain hudi source code
1.4 Download hudi source code
https://github.com/apache/hudi/releases/
2 Hudi source code compilation
2.1 Modify pom file
Modify the hudi-0.12.0/pom.xml file
1) Added repository to accelerate dependency download
<repository> <id>nexus-aliyun</id> <name>nexus-aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository>
2) Modify the dependent component version
Modify the corresponding version number according to the components that need to be adapted. The default hadoop version is 2.10.1. The default version of hive is 2.3.1, modify it to the current version you use
2.2 Modify code
Hudi relies on hadoop2 by default. To be compatible with hadoop3, in addition to modifying the version, the following code needs to be modified:
hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java
Otherwise, it will be due to hadoop2.x and 3.x version compatibility issues.
2.3 Modify kafka dependency
There are several kafka dependencies that need to be installed manually, otherwise the compilation error is as follows:
1) Download the jar package
Download via URL: http://packages.confluent.io/archive/5.3/confluent-5.3.4-2.12.zip
After decompression, find the following jar package and upload it to the server
common-config-5.3.4.jar common-utils-5.3.4.jar kafka-avro-serializer-5.3.4.jar kafka-schema-registry-client-5.3.4.jar
2) install to maven local warehouse
mvn install:install-file -DgroupId=io.confluent -DartifactId=common-config -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-config-5.3.4.jar mvn install:install-file -DgroupId=io.confluent -DartifactId=common-utils -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-utils-5.3.4.jar mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-avro-serializer -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-avro-serializer-5.3.4.jar mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-schema-registry-client-5.3.4.jar
2.4 Resolving spark module dependency conflicts
The Hive version is modified to 3.1.0, the jetty it carries is 0.9.3, and the hudi itself uses 0.9.4, there is a dependency conflict
1) Modify the pom file of hudi-spark-bundle, exclude the lower version of jetty, and add the version of jetty specified by hudi:
hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml
At line 280, add the following content:
<!-- Hive --> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-service</artifactId> <version>${hive.version}</version> <scope>${spark.bundle.hive.scope}</scope> <exclusions> <exclusion> <artifactId>guava</artifactId> <groupId>com.google.guava</groupId> </exclusion> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>org.pentaho</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-service-rpc</artifactId> <version>${hive.version}</version> <scope>${spark.bundle.hive.scope}</scope> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-jdbc</artifactId> <version>${hive.version}</version> <scope>${spark.bundle.hive.scope}</scope> <exclusions> <exclusion> <groupId>javax.servlet</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>javax.servlet.jsp</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-metastore</artifactId> <version>${hive.version}</version> <scope>${spark.bundle.hive.scope}</scope> <exclusions> <exclusion> <groupId>javax.servlet</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>org.datanucleus</groupId> <artifactId>datanucleus-core</artifactId> </exclusion> <exclusion> <groupId>javax.servlet.jsp</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <artifactId>guava</artifactId> <groupId>com.google.guava</groupId> </exclusion> </exclusions> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-common</artifactId> <version>${hive.version}</version> <scope>${spark.bundle.hive.scope}</scope> <exclusions> <exclusion> <groupId>org.eclipse.jetty.orbit</groupId> <artifactId>javax.servlet</artifactId> </exclusion> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <!-- Add hudi configuration version of jetty --> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-server</artifactId> <version>${jetty.version}</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-util</artifactId> <version>${jetty.version}</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-webapp</artifactId> <version>${jetty.version}</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-http</artifactId> <version>${jetty.version}</version> </dependency>
hive-service
hive-jdbc
hive-metastore
hive-common
last added
2) Modify the pom file of hudi-utilities-bundle, exclude the lower version of jetty, and add the version of jetty specified by hudi:
hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml
<!-- Hoodie --> <dependency> <groupId>org.apache.hudi</groupId> <artifactId>hudi-common</artifactId> <version>${project.version}</version> <exclusions> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>org.apache.hudi</groupId> <artifactId>hudi-client-common</artifactId> <version>${project.version}</version> <exclusions> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <!-- Hive --> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-service</artifactId> <version>${hive.version}</version> <scope>${utilities.bundle.hive.scope}</scope> <exclusions> <exclusion> <artifactId>servlet-api</artifactId> <groupId>javax.servlet</groupId> </exclusion> <exclusion> <artifactId>guava</artifactId> <groupId>com.google.guava</groupId> </exclusion> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>org.pentaho</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-service-rpc</artifactId> <version>${hive.version}</version> <scope>${utilities.bundle.hive.scope}</scope> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-jdbc</artifactId> <version>${hive.version}</version> <scope>${utilities.bundle.hive.scope}</scope> <exclusions> <exclusion> <groupId>javax.servlet</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>javax.servlet.jsp</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-metastore</artifactId> <version>${hive.version}</version> <scope>${utilities.bundle.hive.scope}</scope> <exclusions> <exclusion> <groupId>javax.servlet</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <groupId>org.datanucleus</groupId> <artifactId>datanucleus-core</artifactId> </exclusion> <exclusion> <groupId>javax.servlet.jsp</groupId> <artifactId>*</artifactId> </exclusion> <exclusion> <artifactId>guava</artifactId> <groupId>com.google.guava</groupId> </exclusion> </exclusions> </dependency> <dependency> <groupId>${hive.groupid}</groupId> <artifactId>hive-common</artifactId> <version>${hive.version}</version> <scope>${utilities.bundle.hive.scope}</scope> <exclusions> <exclusion> <groupId>org.eclipse.jetty.orbit</groupId> <artifactId>javax.servlet</artifactId> </exclusion> <exclusion> <groupId>org.eclipse.jetty</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <!-- Add hudi configuration version of jetty --> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-server</artifactId> <version>${jetty.version}</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-util</artifactId> <version>${jetty.version}</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-webapp</artifactId> <version>${jetty.version}</version> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-http</artifactId> <version>${jetty.version}</version> </dependency>
hive-service
hive-jdbc
hive-metastore
hive-common
new content
2.5 Executing compilation
mvn clean package -DskipTests -Dspark3.2 -Dflink1.13 -Dscala-2.12 -Dhadoop.version=3.1.1 -Pflink-bundle-shade-hive3
Note: Here, the spark/flink component version is specified by the -D parameter, and the scala version is specified
3 Verification
After the compilation is successful, enter hudi-cli to indicate success
After the compilation is complete, the relevant packages are in each module of the packaging directory: