Hive3 on Spark3 configuration

1. Software environment

1.1 Big data component environment

Big Data Component	Version
Hive	3.1.2
Spark	spark-3.0.0-bin-hadoop3.2

1.2 Operating system environment

OS	Version
MacOS	Monterey 12.1
Linux – CentOS	7.6

2. Construction of big data components

2.1 Hive environment construction

1) Hive on Spark description

Hive engines include: default mr, spark, Tez.
Hive on Spark: Hive not only stores metadata but is also responsible for SQL parsing and optimization. The syntax is HQL syntax, and the execution engine becomes Spark. Spark is responsible for using RDD execution.
Spark on Hive: Hive is only used to store metadata, and Spark is responsible for SQL parsing and optimization. The syntax is Spark SQL syntax, and Spark is responsible for execution using RDD.

2) Hive on Spark configuration

(1) Compatibility description

Note: Hive3.1.2 and Spark3.0.0 downloaded from the official website are incompatible by default. Because the Spark version supported by Hive3.1.2 is 2.4.5, we need to recompile the Hive3.1.2 version.

Compilation steps: Download the Hive3.1.2 source code from the official website and modify the Spark version referenced in the pom file to 3.0.0. If the compilation passes, directly package and obtain the jar package. If an error is reported, follow the prompts to modify the relevant methods until no error is reported, and then package and obtain the jar package.

(2) Deploy Spark on the node where Hive is located

If Spark has been deployed before, this step can be skipped.

Spark official website download jar package address
http://spark.apache.org/downloads.html

Upload and unzip unzip spark-3.0.0-bin-hadoop3.2.tgz
[postman@cdh01 software]$ tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/

[postman@cdh01 software]$ mv /opt/module/spark-3.0.0-bin-hadoop3.2 /opt/module/spark

(3) Configure the SPARK_HOME environment variable

[postman@cdh01 software]$ sudo vim /etc/profile.d/my_env.sh
Add the following content.

#SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin

To make it take effect:
source ${environment variable file}

# For MacOS
[postman@cdh01 software]$ source ~/.zshrc

#ForCentOS
[postman@cdh01 software]$ source /etc/profile.d/my_env.sh

(4) Create spark configuration file in hive

[postman@cdh01 software]$ vim /opt/module/hive/conf/spark-defaults.conf

Add the following content (when executing the task, it will be executed according to the following parameters).

spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://cdh01:8020/spark-history
spark.executor.memory 1g
spark.driver.memory 1g

Create the following path in HDFS to store historical logs.

[postman@cdh01 software]$ hadoop fs -mkdir /spark-history

(5) Upload Spark’s pure jar package without hadoop + hive dependencies to HDFS

Note 1: Since Spark3.0.0 non-pure version supports hive2.3.7 version by default, direct use may cause compatibility issues with the installed Hive3.1.2. Therefore, Spark pure version jar package is used, which does not include hadoop and hive related dependencies to avoid conflicts.

Note 2: Hive tasks are ultimately executed by Spark, and Spark task resource allocation is scheduled by Yarn. The task may be assigned to any node in the cluster. Therefore, Spark dependencies need to be uploaded to the HDFS cluster path so that any node in the cluster can obtain them.

Upload and unzip spark-3.0.0-bin-without-hadoop.tgz

[postman@cdh01 software]$ tar -zxf /opt/software/spark-3.0.0-bin-without-hadoop.tgz

Upload Spark pure version jar package to HDFS

[postman@cdh01 software]$ hadoop fs -mkdir -p /spark-jars
[postman@cdh01 software]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars

(6) Modify hive-site.xml file

[postman@cdh01 ~]$ vim /opt/module/hive/conf/hive-site.xml

Add the following content.

<!--Spark dependency location (note: port number 8020 must be consistent with the port number of namenode) -->
<property>
    <name>spark.yarn.jars</name>
    <value>hdfs://cdh01:8020/spark-jars/*</value>
</property>
  
<!--Hive execution engine-->
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>

7) Modify the $SPARK_HOME/conf/spark-env.sh file

[postman@cdh01 ~]$ vim $SPARK_HOME/conf/spark-env.sh

Add the following content.

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Otherwise, exceptions will be reported that various hadoop dependency packages are missing, such as log4j, Hadoop Configuration and other packages are missing.

2.2 Hive on Spark test

(1) Start hive client

[postman@cdh01 hive]$ bin/hive

(2) Create a test table

hive (default)> create table user(id int, name string);

(3) Test the effect through insert

hive (default)> insert into table user values(1001,'zhangsan');

If the result is as follows, the configuration is successful.

hive (default)> insert into table user values(1001,'zhangsan');
Query ID = user_20231108165919_9908b655-96a7-4ccb-bb62-4dde28df9394
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Running with YARN Application = application_1699425455296_0013
Kill Command = /opt/module/hadoop-3.1.3/bin/yarn application -kill application_1699425455296_0013
Hive on Spark Session Web UI URL: http://192.168.1.1:60145

Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount( + RunningTasksCount-FailedTasksCount)/TotalTasksCount
2023-11-08 16:59:35,314 Stage-0_0: 0/1 Stage-1_0: 0/1
2023-11-08 16:59:37,331 Stage-0_0: 1/1 Finished Stage-1_0: 0/1
2023-11-08 16:59:39,363 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished
Spark job[0] finished successfully in 6.09 second(s)
Loading data to table default.user
OK
col1 col2
Time taken: 20.569 seconds
hive (default)> select * from user;
OK
user.id user.name
1001 zhangsan

3. Errors during installation

Zstd library file error under 3.1 M1 chip

When executing MR class sql such as “insert into table user values(1001,’zhangsan’);”, the program is stuck on the Console for a long time, but no error log is output. At this time, the log format is:

hive (default)> insert into table student values(1,'abc');
Query ID = davidliu_20231108163620_eb8fabe4-b615-4d12-9dba-56ead5946a98
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Running with YARN Application = application_1699425455296_0010
Kill Command = /opt/module/hadoop-3.1.3/bin/yarn application -kill application_1699425455296_0010
Hive on Spark Session Web UI URL: http://192.168.154.240:56101

Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount( + RunningTasksCount-FailedTasksCount)/TotalTasksCount
2023-11-08 16:36:36,031 Stage-0_0: 0/1 Stage-1_0: 0/1
2023-11-08 16:36:39,089 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:36:42,148 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:36:45,201 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:36:48,270 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:36:51,331 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:36:54,385 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:36:57,435 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:00,478 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:03,517 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:06,572 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:09,606 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:12,653 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:15,700 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:18,737 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:21,790 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:24,832 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:27,874 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
2023-11-08 16:37:30,914 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
...
...
2023-11-08 16:37:33,974 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1
Interrupting... Be patient, this might take some time.
Press Ctrl + C again to kill JVM
Exiting the JVM

Before “Ctrl + C” cancels the sql execution, go to the yarn control page to check the results of the program execution:

On the WebUI page, from the execution log of a certain MR task of a certain failed Application, the following error was found:

Caused by: java.lang.UnsatisfiedLinkError: no zstd-jni in java.library.path
Unsupported OS/arch, cannot find /darwin/aarch64/libzstd-jni.dylib or load zstd-jni from system libraries. Please try building from source the jar or providing libzstd-jni in your system.
       at java.lang.Runtime.loadLibrary0(Runtime.java:1011)
       at java.lang.System.loadLibrary(System.java:1657)
       at com.github.luben.zstd.util.Native.load(Native.java:85)
       at com.github.luben.zstd.util.Native.load(Native.java:55)
       at com.github.luben.zstd.Zstd.<clinit>(Zstd.java:13)
       at com.github.luben.zstd.Zstd.decompressedSize(Zstd.java:579)

At the same time, an error log about this was also found in the running log of Hadoop ResourceManager.

It can be seen from the above log that the zstd software library package (function: file compression) cannot be highly supported under the M1 chip. Combined with the library package path search and comparison run by Hive On Spark, it is finally uploaded to the HDFS cluster path /spark The zstd jar package was found in the dependent jar package of Hive on Spark under -jars:

zstd-jni-1.4.4-3.jar

After investigation, developers have previously reported this problem under the zstd github project, and some netizens reported that the problem has been fixed in the “1.4.9-1” version.
So download the jar package from the mvnrepository website:

zstd-jni-1.4.9-1.jar

After that, delete the original “zstd-jni-1.4.4-3.jar” under the HDFS path “hdfs://cdh01:8020/spark-jars/*” and replace it with “zstd-jni-1.4.9- 1.jar” (as shown in the picture above), after testing again, the problem was solved.