1. Foreword
- This article takes you to compile hue together, and use hue to adapt livy + spark. By combining Hue, Livy and Spark SQL, you can write and execute SQL queries in a friendly web interface, and run them on a remote Spark cluster.
1. Introduction to Hue
- Hue (Hadoop User Experience) is an open source Apache Hadoop UI system evolved from Cloudera Desktop, and finally Cloudera contributed it to the Hadoop community of the Apache Foundation, which is implemented based on the Python web framework Django. Hue provides a graphical user interface for the KMR cluster, which is convenient for users to configure, use and view the KMR cluster. Hue official website, Github
- Access to HDFS and file browsing
- Debug and develop hive and display data results through web
- solr query, result display, report generation
- Debug and develop impala interactive SQL Query via web
- Spark debugging and development
- Pig development and debugging
- Development, monitoring and workflow coordination and scheduling of oozie tasks
- Hbase data query and modification, data display
- Hive metadata (metastore) query
- MapReduce task progress view, log tracking
- Create and submit MapReduce, Streaming, Java job tasks
- Development and debugging of Sqoop2
- Browse and edit Zookeeper
- Query and display of databases (MySQL, PostgreSQL, SQlite, Oracle)
2. Introduction to livy
- Apache Livy is a service that can easily interact with the Spark cluster through the REST interface. It can easily submit Spark jobs or Spark code, perform synchronous or asynchronous result retrieval and Spark Context context management. Apache Livy also simplifies Spark and applications. Interaction between program servers, thus enabling Spark to be used for interactive web mobile applications. livy official website, Github
2. Compile and install Hue
name | version |
---|---|
hue | 4.10 |
livy | 0.8 |
Spark | 3.2.1 |
Considering that the compilation is cumbersome and complicated, the relevant compiled packages have been placed on the network disk, and you can pick them up if you need them.
Link: https://pan.baidu.com/s/1BOsnKKwKmTohSGbzi-3QpQ
Extraction code: C3hp
1. Install dependency packages
Hue depends on a number of packages that need to be installed before installing Hue.
yum install \ ant \ asciidoc \ cyrus-sasl-devel\ cyrus-sasl-gssapi\ cyrus-sasl-plain \ gcc \ gcc -c++ \ krb5-devel\ libffi-devel \ libxml2-devel\ libxslt-devel\ make \ mysql\ mysql-devel\ openldap-devel\ python-devel \ sqlite-devel\ gmp-devel\ npm
2. Compile Hue
Go to the hue installation directory and execute make.
-
make apps
It packs all the installation packages into the current source code package. If the installation package is migrated, the entire current package needs to be packaged together, which will cause a large amount of redundant source code data in the installation package. -
PREFIX=/data/hue-4.10 make install
If you need to migrate the installation package, you must pay attention to the path of the installation package, because some dependencies in hue use absolute paths, so when migrating the installation package, try to keep the path unchanged. -
After the compilation is successful, Hue can be started on the compilation machine. If you need to deploy on other machines, you need to pack and compress the /data/hue-4.10 directory and copy it to the target machine. Try to keep the path unchanged.
-
After the compilation is complete, you can enter echo $? to view the status code. If it is 0, the compilation is successful:
3. Problems encountered
If there is a problem in the compilation process every time, try to make clean before recompiling, so as to avoid the last wrong compilation and installation residue from affecting the subsequent reinstallation.
1. npm download fails
Solution: The default mirror source of npm is a foreign address, so the download will be very slow or unable to download, so it can be solved by modifying the mirror source downloaded by npm to the domestic Taobao mirror source.
npm config set proxy null to clear the proxy npm cache clean --force to clear the cache npm config set registry https://registry.npm.taobao.org set proxy
2. nodejs version problem
- The reason for its appearance is that nodejs version 18 and above involve packages related to the latest underlying operating system. Therefore, when installing nodejs, it is recommended to use a version below 18.
3. Configure Hue
- Decompress the Hue compiled and packaged in the previous step to the target machine /data/hue-4.10/. Then configure /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini
- Copy out a configuration file
cp /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini.tmpl /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini
- Edit the configuration file, the complete configuration file is as follows, refer to (MySQL metadata synchronization, HDFS file browsing integration, livy integration, Hive integration, impala integration, yarn integration, LDAP, etc.)
vim /data/hue-4.10/hue/desktop/conf/pseudo-distributed.ini [desktop] secret_key=ZT0kMfPMbzRaHBx http_host=0.0.0.0 http_port=8887 time_zone=Asia/Shanghai app_blacklist=pig, zookeeper, hbase, oozie, indexer, jobbrowser, rdbms, jobsub, sqoop, metastore django_debug_mode=false http_500_debug_mode=false cherrypy_server_threads=50 default_site_encoding=utf collect_usage=false enable_prometheus=true [[django_admins]] [[custom]] [[auth]] backend=desktop.auth.backend.LdapBackend idle_session_timeout=-1 [[[jwt]]] [[ldap]] ldap_url=ldap://ldap.cdh.com ldap_username_pattern="uid=<username>,ou=People,dc=server,dc=com" use_start_tls=false search_bind_authentication=false create_users_on_login=true base_dn="ou=People,dc=server,dc=com" bind_dn="cn=Manager,dc=server,dc=com" bind_password="eFRrECKfQfoOB25" [[[users]]] [[[groups]]] [[[ldap_servers]]] [[vcs]] [[database]] engine=mysql host=192.168.37.100 port=3306 user=hue password=6a3ZsJtNs8SSCLe name=hue_10 [[session]] [[smtp]] host=localhost port=25 user= password= tls=no [[knox]] [[kerberos]] [[oauth]] [[oidc]] [[metrics]] [[slack]] [[tracing]] [[task_server]] [[gc_accounts]] [[[default]]] [[raz]] [notebook] show_notebooks=true [[interpreters]] [[[hive]]] name=Hive interface=hiveserver2 [[[impala]]] name=Impala interface=hiveserver2 [[[sparksql]]] name=SparkSql interface=livy [[[pyspark]]] name=PySpark interface=livy [dashboard] is_enabled=true [[engines]] [hadoop] [[hdfs_clusters]] [[[default]]] fs_defaultfs=hdfs://nameservice1 webhdfs_url=http://192.168.37.20:14000/webhdfs/v1 hadoop_bin=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop/bin/hadoop security_enabled=false temp_dir=/tmp [[yarn_clusters]] [[[default]]] resourcemanager_host=192.168.37.1 resourcemanager_api_url=http://192.168.37.1:8088/ proxy_api_url=http://192.168.37.1:8088/ resourcemanager_port=8032 logical_name=yarnRM history_server_api_url=http://192.168.37.1:19888/ security_enabled=false submit_to=true hadoop_mapred_home=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop-mapreduce hadoop_bin=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop/bin/hadoop [[[ha]]] resourcemanager_host=192.168.37.20 resourcemanager_api_url=http://192.168.37.20:8088/ proxy_api_url=http://192.168.37.20:8088/ resourcemanager_port=8032 logical_name=yarnRM history_server_api_url=http://192.168.37.1:19888/ security_enabled=false submit_to=true hadoop_mapred_home=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop-mapreduce hadoop_bin=/opt/cloudera/parcels/CDH-6.0.0-1.cdh6.0.0.p0.537114/lib/hadoop/bin/hadoop [beeswax] hive_server_host=192.168.37.242 hive_server_port=10009 server_conn_timeout=120 download_row_limit=10000 auth_username=hive auth_password=ZT0kMfPMbzRaHBx use_sasl=true thrift_version=7 hive_metastore_host=192.168.37.162 hive_metastore_port=9083 [[ssl]] [metastore] [impala] server_host=192.168.37.242 server_port=21052 impersonation_enabled=True server_conn_timeout=120 auth_username=hive auth_password=ZT0kMfPMbzRaHBx [[ssl]] [spark] livy_server_url=http://192.168.37.160:8998 [oozie] [filebrowser] [pig] [sqoop] [proxy] [hbase] [search] [libsolr] [indexer] [jobsub] [jobbrowser] [[query_store]] [security] [zookeeper] [[clusters]] [[[default]]] [useradmin] [[password_policy]] [liboozie] [aws] [[aws_accounts]] [azure] [[azure_accounts]] [[[default]]] [[adls_clusters]] [[[default]]] [[abfs_clusters]] [[[default]]] [libsentry] hostname=192.168.37.1 port=8038 [libzookeeper] [librdbms] [[databases]] [libsaml] [liboauth] [kafka] [[kafka]] [metadata] [[manager]] [[optimizer]] [[catalog]] [[navigator]] [[prometheus]]
4. Initialize Hue
sudo -u hue /data/hue-4.10/hue/build/env/bin/hue syncdb sudo -u hue /data/hue-4.10/hue/build/env/bin/hue migrate
4. Start Hue
sudo -u hue nohup /data/hue-4.10/hue/build/env/bin/hue runserver 0.0.0.0:8888 &
1. Log in to Hue
The first login will use the entered username and password as a super administrator. If you forget the password, you can use the above command to initialize again.
2. Hue test Spark
a, Spark-Sql
- Spark SQL is a module of Apache Spark that provides a high-level API and query language for structured data processing. It enables easy execution of SQL queries in Spark.
b, Pyspark
- PySpark is an efficient and easy-to-use Python API for processing large data sets. It allows Python developers to quickly process large-scale data sets in a distributed computing environment, and can
Data analysis and mining can be carried out through various data processing methods. The following is an example of pyspark that counts the number of occurrences of each word in a list
# input data data = ["hello", "world", "hello", "world"] # Convert collection data to rdd in spark and perform operations rdd = sc. parallelize(data) res_rdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b) # Convert rdd to collection and print res_rdd_coll = res_rdd. collect() for line in res_rdd_coll: print(line) # Finish sc. stop()
5. Write Hue system startup script
1. Hue shell script
- The shell script is as follows, don’t forget that the user needs to execute permissions
chmod +x /data/hue-4.10/hue/hue_service.sh
#!/bin/bash if [ $# -ne 2 ]; then echo "please input two params, first is (hue), second is (start|stop)" exit 0 the fi if [ "$1" == "hue" ]; then if [ "$2" == "start" ]; then cd /data/hue-4.10/hue/logs echo "now is start hue" nohup /data/hue-4.10/hue/build/env/bin/hue runserver 0.0.0.0:8887 > /data/hue-4.10/hue/logs/info.log 2> & amp;1 & amp; exit 0 elif [ "$2" == "stop" ]; then hue_pid=$(netstat -nltp|grep 8887|awk '{print $NF}'|awk -F "/" '{print $1}') kill ${hue_pid} echo "hue has stopped" exit 0 else echo "second param please input 'start' or 'stop'" exit 0 the fi else echo "first param please input 'hue'" the fi
2. Hue system script
- Create the corresponding hue.service file in /usr/lib/systemd/system
# /usr/lib/systemd/system/hue.service [Unit] Description=hue Wants=network-online.target After=network-online.target [Service] Type=forking User=hue Group=hue ExecStart=/data/hue-4.10/hue/hue_service.sh hue start ExecStop=/data/hue-4.10/hue/hue_service.sh hue stop Restart=no [Install] WantedBy=multi-user.target
3. reload systemctl
systemctl daemon-reload
4. Start the test
systemctl start hue.service
5. Verification
systemctl status hue.service
6. Usage issues
1. The yarn queue cannot be bound
- In hue, Livy starts Scala, Spark, and PySpark all use the default queue root.default. This is because the YARN queue name is bound to the hue code by default. We can modify the queue by modifying the hue source code. GitHub related issues
vim /data/hue-4.10/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py
2. Chinese error report
- In the use of the new version of hue Chinese direct error, GitHub related issues
- Modify the source code to support utf-8, edit oprot.writeString(self.statement) to oprot.writeString(self.statement.encode(utf-8’))
vim /data/hue-4.10/hue/apps/beeswax/gen-py/TCLIService/ttypes.py
7. Summary
- Hue is an open source SQL assistant for data warehousing. It can be integrated with Livy for easier development of SQL fragments. Apache Livy provides a bridge to interact with the running Spark interpreter so that SQL, pyspark and scala fragments can be executed interactively.
- Hue combined with livy can not only add sparksql + pyspark but also Scala + R
Related references:
hue compile and install
Compile and package HUE4.10
HUE compilation