1. Software environment
1.1 Big data component environment
Big Data Component | Version |
---|---|
Hive | 3.1.2 |
Spark | spark-3.0.0-bin-hadoop3.2 |
1.2 Operating system environment
OS | Version |
---|---|
MacOS | Monterey 12.1 |
Linux – CentOS | 7.6 |
2. Construction of big data components
2.1 Hive environment construction
1) Hive on Spark description
Hive engines include: default
mr
,spark
,Tez
.
Hive on Spark: Hive not only stores metadata but is also responsible for SQL parsing and optimization. The syntax is HQL syntax, and the execution engine becomes Spark. Spark is responsible for using RDD execution.
Spark on Hive: Hive is only used to store metadata, and Spark is responsible for SQL parsing and optimization. The syntax is Spark SQL syntax, and Spark is responsible for execution using RDD.
2) Hive on Spark configuration
(1) Compatibility description
Note: Hive3.1.2 and Spark3.0.0 downloaded from the official website are incompatible by default. Because the Spark version supported by Hive3.1.2 is 2.4.5, we need to recompile the Hive3.1.2 version.
Compilation steps: Download the Hive3.1.2 source code from the official website and modify the Spark version referenced in the pom file to 3.0.0. If the compilation passes, directly package and obtain the jar package. If an error is reported, follow the prompts to modify the relevant methods until no error is reported, and then package and obtain the jar package.
(2) Deploy Spark on the node where Hive is located
If Spark has been deployed before, this step can be skipped.
Spark official website download jar package address
http://spark.apache.org/downloads.html
Upload and unzip unzip spark-3.0.0-bin-hadoop3.2.tgz
[postman@cdh01 software]$ tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
[postman@cdh01 software]$ mv /opt/module/spark-3.0.0-bin-hadoop3.2 /opt/module/spark
(3) Configure the SPARK_HOME environment variable
[postman@cdh01 software]$ sudo vim /etc/profile.d/my_env.sh
Add the following content.
#SPARK_HOME export SPARK_HOME=/opt/module/spark export PATH=$PATH:$SPARK_HOME/bin
To make it take effect:
source ${environment variable file}
# For MacOS [postman@cdh01 software]$ source ~/.zshrc #ForCentOS [postman@cdh01 software]$ source /etc/profile.d/my_env.sh
(4) Create spark configuration file in hive
[postman@cdh01 software]$ vim /opt/module/hive/conf/spark-defaults.conf
Add the following content (when executing the task, it will be executed according to the following parameters).
spark.master yarn spark.eventLog.enabled true spark.eventLog.dir hdfs://cdh01:8020/spark-history spark.executor.memory 1g spark.driver.memory 1g
Create the following path in HDFS to store historical logs.
[postman@cdh01 software]$ hadoop fs -mkdir /spark-history
(5) Upload Spark’s pure jar package without hadoop + hive dependencies to HDFS
- Note 1: Since Spark3.0.0 non-pure version supports hive2.3.7 version by default, direct use may cause compatibility issues with the installed Hive3.1.2. Therefore, Spark pure version jar package is used, which does not include hadoop and hive related dependencies to avoid conflicts.
- Note 2: Hive tasks are ultimately executed by Spark, and Spark task resource allocation is scheduled by Yarn. The task may be assigned to any node in the cluster. Therefore, Spark dependencies need to be uploaded to the HDFS cluster path so that any node in the cluster can obtain them.
Upload and unzip spark-3.0.0-bin-without-hadoop.tgz
[postman@cdh01 software]$ tar -zxf /opt/software/spark-3.0.0-bin-without-hadoop.tgz
Upload Spark pure version jar package to HDFS
[postman@cdh01 software]$ hadoop fs -mkdir -p /spark-jars [postman@cdh01 software]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars
(6) Modify hive-site.xml file
[postman@cdh01 ~]$ vim /opt/module/hive/conf/hive-site.xml
Add the following content.
<!--Spark dependency location (note: port number 8020 must be consistent with the port number of namenode) --> <property> <name>spark.yarn.jars</name> <value>hdfs://cdh01:8020/spark-jars/*</value> </property> <!--Hive execution engine--> <property> <name>hive.execution.engine</name> <value>spark</value> </property>
7) Modify the $SPARK_HOME/conf/spark-env.sh
file
[postman@cdh01 ~]$ vim $SPARK_HOME/conf/spark-env.sh
Add the following content.
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
Otherwise, exceptions will be reported that various hadoop dependency packages are missing, such as log4j, Hadoop Configuration and other packages are missing.
2.2 Hive on Spark test
(1) Start hive client
[postman@cdh01 hive]$ bin/hive
(2) Create a test table
hive (default)> create table user(id int, name string);
(3) Test the effect through insert
hive (default)> insert into table user values(1001,'zhangsan');
If the result is as follows, the configuration is successful.
hive (default)> insert into table user values(1001,'zhangsan'); Query ID = user_20231108165919_9908b655-96a7-4ccb-bb62-4dde28df9394 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Running with YARN Application = application_1699425455296_0013 Kill Command = /opt/module/hadoop-3.1.3/bin/yarn application -kill application_1699425455296_0013 Hive on Spark Session Web UI URL: http://192.168.1.1:60145 Query Hive on Spark job[0] stages: [0, 1] Spark job[0] status = RUNNING Job Progress Format CurrentTime StageId_StageAttemptId: SucceededTasksCount( + RunningTasksCount-FailedTasksCount)/TotalTasksCount 2023-11-08 16:59:35,314 Stage-0_0: 0/1 Stage-1_0: 0/1 2023-11-08 16:59:37,331 Stage-0_0: 1/1 Finished Stage-1_0: 0/1 2023-11-08 16:59:39,363 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished Spark job[0] finished successfully in 6.09 second(s) Loading data to table default.user OK col1 col2 Time taken: 20.569 seconds hive (default)> select * from user; OK user.id user.name 1001 zhangsan
3. Errors during installation
Zstd library file error under 3.1 M1 chip
When executing MR class sql such as “insert into table user values(1001,’zhangsan’);”, the program is stuck on the Console for a long time, but no error log is output. At this time, the log format is:
hive (default)> insert into table student values(1,'abc'); Query ID = davidliu_20231108163620_eb8fabe4-b615-4d12-9dba-56ead5946a98 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Running with YARN Application = application_1699425455296_0010 Kill Command = /opt/module/hadoop-3.1.3/bin/yarn application -kill application_1699425455296_0010 Hive on Spark Session Web UI URL: http://192.168.154.240:56101 Query Hive on Spark job[0] stages: [0, 1] Spark job[0] status = RUNNING Job Progress Format CurrentTime StageId_StageAttemptId: SucceededTasksCount( + RunningTasksCount-FailedTasksCount)/TotalTasksCount 2023-11-08 16:36:36,031 Stage-0_0: 0/1 Stage-1_0: 0/1 2023-11-08 16:36:39,089 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:36:42,148 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:36:45,201 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:36:48,270 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:36:51,331 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:36:54,385 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:36:57,435 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:00,478 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:03,517 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:06,572 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:09,606 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:12,653 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:15,700 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:18,737 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:21,790 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:24,832 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:27,874 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 2023-11-08 16:37:30,914 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 ... ... 2023-11-08 16:37:33,974 Stage-0_0: 1/1 Finished Stage-1_0: 0( + 1)/1 Interrupting... Be patient, this might take some time. Press Ctrl + C again to kill JVM Exiting the JVM
Before “Ctrl + C” cancels the sql execution, go to the yarn control page to check the results of the program execution:
On the WebUI page, from the execution log of a certain MR task of a certain failed Application, the following error was found:
Caused by: java.lang.UnsatisfiedLinkError: no zstd-jni in java.library.path Unsupported OS/arch, cannot find /darwin/aarch64/libzstd-jni.dylib or load zstd-jni from system libraries. Please try building from source the jar or providing libzstd-jni in your system. at java.lang.Runtime.loadLibrary0(Runtime.java:1011) at java.lang.System.loadLibrary(System.java:1657) at com.github.luben.zstd.util.Native.load(Native.java:85) at com.github.luben.zstd.util.Native.load(Native.java:55) at com.github.luben.zstd.Zstd.<clinit>(Zstd.java:13) at com.github.luben.zstd.Zstd.decompressedSize(Zstd.java:579)
At the same time, an error log about this was also found in the running log of Hadoop ResourceManager.
It can be seen from the above log that the zstd software library package (function: file compression) cannot be highly supported under the M1 chip. Combined with the library package path search and comparison run by Hive On Spark, it is finally uploaded to the HDFS cluster path /spark The zstd jar package was found in the dependent jar package of Hive on Spark under -jars:
- zstd-jni-1.4.4-3.jar
After investigation, developers have previously reported this problem under the zstd github project, and some netizens reported that the problem has been fixed in the “1.4.9-1” version.
So download the jar package from the mvnrepository website:
- zstd-jni-1.4.9-1.jar
After that, delete the original “zstd-jni-1.4.4-3.jar” under the HDFS path “hdfs://cdh01:8020/spark-jars/*” and replace it with “zstd-jni-1.4.9- 1.jar” (as shown in the picture above), after testing again, the problem was solved.