Compile and install Hue to adapt to sparksql (hue+livy+sparksql+pyspark)

1. Foreword

This article takes you to compile hue together, and use hue to adapt livy + spark. By combining Hue, Livy and Spark SQL, you can write and execute SQL queries in a friendly web interface, and run them on a remote Spark cluster.

1. Introduction to Hue

Hue (Hadoop User Experience) is an open source Apache Hadoop UI system evolved from Cloudera Desktop, and finally Cloudera contributed it to the Hadoop community of the Apache Foundation, which is implemented based on the Python web framework Django. Hue provides a graphical user interface for the KMR cluster, which is convenient for users to configure, use and view the KMR cluster. Hue official website, Github

Access to HDFS and file browsing
Debug and develop hive and display data results through web
solr query, result display, report generation
Debug and develop impala interactive SQL Query via web
Spark debugging and development
Pig development and debugging
Development, monitoring and workflow coordination and scheduling of oozie tasks
Hbase data query and modification, data display
Hive metadata (metastore) query
MapReduce task progress view, log tracking
Create and submit MapReduce, Streaming, Java job tasks
Development and debugging of Sqoop2
Browse and edit Zookeeper
Query and display of databases (MySQL, PostgreSQL, SQlite, Oracle)

2. Introduction to livy

Apache Livy is a service that can easily interact with the Spark cluster through the REST interface. It can easily submit Spark jobs or Spark code, perform synchronous or asynchronous result retrieval and Spark Context context management. Apache Livy also simplifies Spark and applications. Interaction between program servers, thus enabling Spark to be used for interactive web mobile applications. livy official website, Github

2. Compile and install Hue

name	version
hue	4.10
livy	0.8
Spark	3.2.1

Considering that the compilation is cumbersome and complicated, the relevant compiled packages have been placed on the network disk, and you can pick them up if you need them.
Link: https://pan.baidu.com/s/1BOsnKKwKmTohSGbzi-3QpQ
Extraction code: C3hp

1. Install dependency packages

Hue depends on a number of packages that need to be installed before installing Hue.

yum install \
ant \
asciidoc \
cyrus-sasl-devel\
cyrus-sasl-gssapi\
cyrus-sasl-plain \
gcc \
gcc -c++ \
krb5-devel\
libffi-devel \
libxml2-devel\
libxslt-devel\
make \
mysql\
mysql-devel\
openldap-devel\
python-devel \
sqlite-devel\
gmp-devel\
npm

2. Compile Hue

Go to the hue installation directory and execute make.

make apps
It packs all the installation packages into the current source code package. If the installation package is migrated, the entire current package needs to be packaged together, which will cause a large amount of redundant source code data in the installation package.
PREFIX=/data/hue-4.10 make install
If you need to migrate the installation package, you must pay attention to the path of the installation package, because some dependencies in hue use absolute paths, so when migrating the installation package, try to keep the path unchanged.
After the compilation is successful, Hue can be started on the compilation machine. If you need to deploy on other machines, you need to pack and compress the /data/hue-4.10 directory and copy it to the target machine. Try to keep the path unchanged.
After the compilation is complete, you can enter echo $? to view the status code. If it is 0, the compilation is successful:

3. Problems encountered

If there is a problem in the compilation process every time, try to make clean before recompiling, so as to avoid the last wrong compilation and installation residue from affecting the subsequent reinstallation.

1. npm download fails

Solution: The default mirror source of npm is a foreign address, so the download will be very slow or unable to download, so it can be solved by modifying the mirror source downloaded by npm to the domestic Taobao mirror source.

npm config set proxy null to clear the proxy

npm cache clean --force to clear the cache

npm config set registry https://registry.npm.taobao.org set proxy

2. nodejs version problem

The reason for its appearance is that nodejs version 18 and above involve packages related to the latest underlying operating system. Therefore, when installing nodejs, it is recommended to use a version below 18.

3. Configure Hue

Decompress the Hue compiled and packaged in the previous step to the target machine /data/hue-4.10/. Then configure /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini

Copy out a configuration file

cp /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini.tmpl /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini

Edit the configuration file, the complete configuration file is as follows, refer to (MySQL metadata synchronization, HDFS file browsing integration, livy integration, Hive integration, impala integration, yarn integration, LDAP, etc.)

vim /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini

[desktop]
secret_key=ZT0kMfPMbzRaHBx
http_host=0.0.0.0
http_port=8887
time_zone=Asia/Shanghai
app_blacklist=pig, zookeeper, hbase, oozie, indexer, jobbrowser, rdbms, jobsub, sqoop, metastore
django_debug_mode=false
http_500_debug_mode=false
cherrypy_server_threads=50
default_site_encoding=utf
collect_usage=false
enable_prometheus=true
[[django_admins]]
[[custom]]
[[auth]]
backend=desktop.auth.backend.LdapBackend
idle_session_timeout=-1
[[[jwt]]]
[[ldap]]
ldap_url=ldap://ldap.cdh.com
ldap_username_pattern="uid=<username>,ou=People,dc=server,dc=com"
use_start_tls=false
search_bind_authentication=false
create_users_on_login=true
base_dn="ou=People,dc=server,dc=com"
bind_dn="cn=Manager,dc=server,dc=com"
bind_password="eFRrECKfQfoOB25"
[[[users]]]
[[[groups]]]
[[[ldap_servers]]]
[[vcs]]
[[database]]
engine=mysql
host=192.168.37.100
port=3306
user=hue
password=6a3ZsJtNs8SSCLe
name=hue_10
[[session]]
[[smtp]]
host=localhost
port=25
user=
password=
tls=no
[[knox]]
[[kerberos]]
[[oauth]]
[[oidc]]
[[metrics]]
[[slack]]
[[tracing]]
[[task_server]]
[[gc_accounts]]
[[[default]]]
[[raz]]
[notebook]
show_notebooks=true
[[interpreters]]

[[[hive]]]
name=Hive
interface=hiveserver2

[[[impala]]]
name=Impala
interface=hiveserver2

[[[sparksql]]]
name=SparkSql
interface=livy

[[[pyspark]]]
name=PySpark
interface=livy

[dashboard]
is_enabled=true
[[engines]]
[hadoop]
[[hdfs_clusters]]
[[[default]]]
fs_defaultfs=hdfs://nameservice1
webhdfs_url=http://192.168.37.20:14000/webhdfs/v1
hadoop_bin=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop/bin/hadoop
security_enabled=false
temp_dir=/tmp
[[yarn_clusters]]
[[[default]]]
resourcemanager_host=192.168.37.1
resourcemanager_api_url=http://192.168.37.1:8088/
proxy_api_url=http://192.168.37.1:8088/
resourcemanager_port=8032
logical_name=yarnRM
history_server_api_url=http://192.168.37.1:19888/
security_enabled=false
submit_to=true
hadoop_mapred_home=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop-mapreduce
hadoop_bin=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop/bin/hadoop
[[[ha]]]
resourcemanager_host=192.168.37.20
resourcemanager_api_url=http://192.168.37.20:8088/
proxy_api_url=http://192.168.37.20:8088/
resourcemanager_port=8032
logical_name=yarnRM
history_server_api_url=http://192.168.37.1:19888/
security_enabled=false
submit_to=true
hadoop_mapred_home=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop-mapreduce
hadoop_bin=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop/bin/hadoop
[beeswax]
hive_server_host=192.168.37.242
hive_server_port=10009
server_conn_timeout=120
download_row_limit=10000
auth_username=hive
auth_password=ZT0kMfPMbzRaHBx
use_sasl=true
thrift_version=7
hive_metastore_host=192.168.37.162
hive_metastore_port=9083
[[ssl]]
[metastore]
[impala]
server_host=192.168.37.242
server_port=21052
impersonation_enabled=True
server_conn_timeout=120
auth_username=hive
auth_password=ZT0kMfPMbzRaHBx
[[ssl]]
[spark]
livy_server_url=http://192.168.37.160:8998
[oozie]
[filebrowser]
[pig]
[sqoop]
[proxy]
[hbase]
[search]
[libsolr]
[indexer]
[jobsub]
[jobbrowser]
[[query_store]]
[security]
[zookeeper]
[[clusters]]
[[[default]]]
[useradmin]
[[password_policy]]
[liboozie]
[aws]
[[aws_accounts]]
[azure]
[[azure_accounts]]
[[[default]]]
[[adls_clusters]]
[[[default]]]
[[abfs_clusters]]
[[[default]]]
[libsentry]
hostname=192.168.37.1
port=8038
[libzookeeper]
[librdbms]
[[databases]]
[libsaml]
[liboauth]
[kafka]
[[kafka]]
[metadata]
[[manager]]
[[optimizer]]
[[catalog]]
[[navigator]]
[[prometheus]]

4. Initialize Hue

sudo -u hue /data/hue-4.10/hue/build/env/bin/hue syncdb

sudo -u hue /data/hue-4.10/hue/build/env/bin/hue migrate

4. Start Hue

sudo -u hue nohup /data/hue-4.10/hue/build/env/bin/hue runserver 0.0.0.0:8888 &

1. Log in to Hue

The first login will use the entered username and password as a super administrator. If you forget the password, you can use the above command to initialize again.

2. Hue test Spark

a, Spark-Sql

Spark SQL is a module of Apache Spark that provides a high-level API and query language for structured data processing. It enables easy execution of SQL queries in Spark.

b, Pyspark

PySpark is an efficient and easy-to-use Python API for processing large data sets. It allows Python developers to quickly process large-scale data sets in a distributed computing environment, and can
Data analysis and mining can be carried out through various data processing methods. The following is an example of pyspark that counts the number of occurrences of each word in a list

# input data
data = ["hello", "world", "hello", "world"]

# Convert collection data to rdd in spark and perform operations
rdd = sc. parallelize(data)
res_rdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Convert rdd to collection and print
res_rdd_coll = res_rdd. collect()
for line in res_rdd_coll:
    print(line)

# Finish
sc. stop()

5. Write Hue system startup script

1. Hue shell script

The shell script is as follows, don’t forget that the user needs to execute permissions

chmod +x /data/hue-4.10/hue/hue_service.sh

#!/bin/bash

if [ $# -ne 2 ]; then
        echo "please input two params, first is (hue), second is (start|stop)"
        exit 0
the fi

if [ "$1" == "hue" ]; then
        if [ "$2" == "start" ]; then
                cd /data/hue-4.10/hue/logs
                echo "now is start hue"
                nohup /data/hue-4.10/hue/build/env/bin/hue runserver 0.0.0.0:8887 > /data/hue-4.10/hue/logs/info.log 2> & amp;1 & amp;
                exit 0
        elif [ "$2" == "stop" ]; then
                hue_pid=$(netstat -nltp|grep 8887|awk '{print $NF}'|awk -F "/" '{print $1}')
                kill ${hue_pid}
                echo "hue has stopped"
                exit 0
        else
                echo "second param please input 'start' or 'stop'"
                exit 0
        the fi
else
        echo "first param please input 'hue'"
the fi

2. Hue system script

Create the corresponding hue.service file in /usr/lib/systemd/system

# /usr/lib/systemd/system/hue.service
[Unit]
Description=hue
Wants=network-online.target
After=network-online.target

[Service]
Type=forking
User=hue
Group=hue
ExecStart=/data/hue-4.10/hue/hue_service.sh hue start
ExecStop=/data/hue-4.10/hue/hue_service.sh hue stop
Restart=no

[Install]
WantedBy=multi-user.target

3. reload systemctl

systemctl daemon-reload

4. Start the test

systemctl start hue.service

5. Verification

systemctl status hue.service

6. Usage issues

1. The yarn queue cannot be bound

In hue, Livy starts Scala, Spark, and PySpark all use the default queue root.default. This is because the YARN queue name is bound to the hue code by default. We can modify the queue by modifying the hue source code. GitHub related issues

vim /data/hue-4.10/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py

2. Chinese error report

In the use of the new version of hue Chinese direct error, GitHub related issues
Modify the source code to support utf-8, edit oprot.writeString(self.statement) to oprot.writeString(self.statement.encode(utf-8’))

vim /data/hue-4.10/hue/apps/beeswax/gen-py/TCLIService/ttypes.py

7. Summary

Hue is an open source SQL assistant for data warehousing. It can be integrated with Livy for easier development of SQL fragments. Apache Livy provides a bridge to interact with the running Spark interpreter so that SQL, pyspark and scala fragments can be executed interactively.
Hue combined with livy can not only add sparksql + pyspark but also Scala + R

Related references:
hue compile and install
Compile and package HUE4.10
HUE compilation