Use Hue to play with Amazon EMR (SparkSQL, Phoenix) and Amazon Redshift

Current situation

Apache Hue is a web-based interactive SQL assistant that can help big data practitioners (data warehouse engineers, data analysts, etc.) interact with data warehouses in SQL. Install Hue by checking the box when your Amazon EMR cluster starts. After Hue is enabled, the original work of logging in to the master node to write and submit SQL is transferred to the web front end, which not only facilitates unified management of daily development needs, but also ensures the access security of the cluster. On the other hand, Hue has its own unique advantage of using SparkSQL for remote submission of Spark tasks. Compared with the two methods of configuring Hive on Spark for the Amazon EMR cluster, or using code for remote submission of Livy, it greatly improves development. and operation and maintenance efficiency. This article also introduces how to integrate Amazon Redshift data warehouse through Hue, and remotely submit Phoenix tasks to interact with HBase, making Hue a unified SQL access platform for data warehouses.

The Amazon Cloud Technology Developer Community provides developers with global development technology resources. There are technical documents, development cases, technical columns, training videos, activities and competitions, etc. Help Chinese developers connect with the world’s most cutting-edge technologies, ideas, and projects, and recommend outstanding Chinese developers or technologies to the global cloud community. If you haven’t followed/collected it yet, please don’t rush through it when you see it. Click here to make it your technical treasure trove!

Proposal Architecture Overview

Project Introduction

Submit SparkSQL Job via Livy

Execution engine status

First, let’s briefly compare the current status of several popular execution engines:

Apache MapReduce is the slowest query execution engine due to the high disk IO required to process customer queries.
Apache Tez is significantly faster than Apache MapReduce while keeping disk IO constant.
Apache Spark is slightly faster than Apache Tez without IO blocking. Like Apache Tez, it processes data in the DAG method. Spark is more versatile and provides a variety of computing methods such as in-memory computing, real-time stream processing, and machine learning. It is suitable for Iterative calculation.

Apache Livy Introduction

Apache Livy is a service that makes it easy to interact with Spark clusters through a REST API. The configuration method in this solution can submit SparkSQL written on the Hue page to the EMR cluster through the Livy interface.

EMR Hue handling SparkSQL default behavior

When the Editor selects SparkSQL on the Hue panel and submits the SQL task, we query the Resource Manager console based on the application_id ((Executing on YARN cluster with App id application_1656071365605_0006)) and find that the corresponding Application Type is Tez:

When we open the hue configuration file (/etc/hue/conf/hue.ini) and see the configuration at [[[sql]]] as shown below, the interface is configured with hiveserver2, and we know that SparkSQL is still using hiveserver2 at this time. , so the Tez engine is used (the default Hive execution engine on EMR is Tez), which means that the Spark execution engine is not actually used to run the above Query.

Submit SparkSQL tasks through Livy in EMR Hue

(1) Modify the execution engine in the Hue configuration file (/etc/hue/conf/hue.ini) and restart the Hue service

sudo systemctl restart hue.service
sudo systemctl status hue.service

After resubmitting the SparkSQL task, you can see that the ApplicationType of the Application has been SPARK.

Performance tuning in production scenarios:

The above Application views Environment details through the Spark management interface:

See spark.driver.memory and spark.executor.memory are both set to 1G

This is because the Hue source code directly sets the values of the above two parameters to 1G:

https://github.com/cloudera/hue/blob/bd6324a79c2e6b6d002ddd6767b0e63883373320/desktop/libs/notebook/src/notebook/connectors/spark_shell.py

{
"name": "driverMemory",
"nice_name": _("Driver Memory"),
"help_text": _("Amount of memory to use for the driver process in GB. (Default: 1). "),
"type": "jvm",
"is_yarn": False,
"multiple": False,
"defaultValue": '1G',
"value": '1G',
},
…
{
"name": "executorMemory",
"nice_name": _("Executor Memory"),
"help_text": _("Amount of memory to use per executor process in GB. (Default: 1)"),
"type": "jvm",
"is_yarn": True,
"multiple": False,
"defaultValue": '1G',
"value": '1G',
}

If using default parameter values can easily trigger an OOM exception during task execution, causing the task to fail, we can choose to tune it through the following methods:

cp /usr/lib/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py /usr/lib/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py.bak
sudo vi /usr/lib/hue/desktop/libs/notebook/src/notebook/connectors/spark_shell.py

Delete the configuration of driverMemory’ and executorMemory’ and restart the Hue service

sudo systemctl restart hue.service
sudo systemctl status hue.service

Run SparkSQL again and see from the Environment that the two memory parameters have been updated and are consistent with the definitions in /etc/spark/conf/spark-defaults.conf:

Hue configures Phoenix to submit HBase tasks

Introduction to Apache Phoenix

Apache Phoenix is an open source, massively parallel relational database engine that supports OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the complexity of noSQL storage, enabling users to create, drop, and alter SQL tables, views, indexes, and sequences.

Configuring Phoenix

(1) Prepare Hue Python Virtual Environment

sudo /usr/lib/hue/build/env/bin/pip install phoenixdb

(2) Modify the Hue configuration file:

Add the following to the [notebook] [[interpreters]] section of /etc/hue/conf/hue.ini:

[[[phoenix]]]
name=HBase Phoenix
interface=sqlalchemy
options='{"url": "phoenix:// ip-172-31-37-125.ap-southeast-1.compute.internal:8765/"}'

Restart Hue service

sudo systemctl restart hue.service
sudo systemctl status hue.service

(3) Hue page submission Phoenix task:

Due to the update of the configuration file in the Hue – Editor section, the HBase Phoenix option appears to create and query Table:

CREATE TABLE user (id varchar PRIMARY KEY,name varchar,passwd varchar)
upsert into user(id, name, passwd) values('001', 'admin', 'admin')
select * from user

HBase display column name garbled correction

(1) When the above operation is completed, return to HBase Shell to view the table contents and find that the column names are garbled:

The problem can still be reproduced by creating a table using the Phoenix command line (/usr/lib/phoenix/bin/sqlline.py, not through Hue), and garbled characters will not appear in the Phoenix JDBC connection:

(2) Add COLUMN_ENCODED_BYTES= 0 at the end when creating the table in Phoenix to avoid this problem:

CREATE TABLE user02 (id varchar PRIMARY KEY,name varchar,passwd varchar) COLUMN_ENCODED_BYTES= 0
upsert into user02(id, name, passwd) values('002', 'admin', 'admin')
select * from user02

HBase Shell View the results, the column names are displayed normally:

Hue connects to Redshift to submit tasks

When multiple services such as Amazon EMR and Amazon Redshift are involved in the data warehouse platform, unified interaction functions can be easily implemented through Hue’s rich Connectors extension types.

(1) Prepare Hue Python Virtual Environment

cd /usr/lib/hue/
sudo ./build/env/bin/pip install sqlalchemy-redshift
sudo /usr/lib/hue/build/env/bin/pip2.7 install psycopg2-binary

(2) Modify the Hue configuration file:

Add the following to the [notebook] [[interpreters]] section of /etc/hue/conf/hue.ini:

[[[redshift]]]
name=Redshift
interface=sqlalchemy
  options='{"url": "redshift + psycopg2://username:[email protected]:5439/database"}'

Restart Hue service

sudo systemctl restart hue.service
sudo systemctl status hue.service

(3) Hue page submits Redshift task:

Due to the update of the configuration file in the Hue – Editor section, the Reshift option appears:

Submit a SQL query to easily obtain Amazon Redshift data warehouse data:

Summary

This article mainly helps users of Amazon EMR to implement a unified data warehouse platform development tool through Hue. On the one hand, it centrally manages the SQL development tasks of the data warehouse, and on the other hand, it provides an independent analysis platform for other departments, which has a certain role in promoting the construction of the data warehouse. .

Author of this article

Sunny Fang Amazon technical account manager, mainly supports architecture optimization, cost management, technical consulting, etc. for customers in the financial and Internet industries, and focuses on technical research and practice in the direction of big data and containers. Before joining Amazon, he worked for technology companies such as Citrix and Microsoft, and has 8 years of experience in architecture optimization and support in the field of virtualization and public cloud.

Zhang Yin Amazon technical account manager, responsible for architecture and cost optimization, technical support, etc. for enterprise-level customers. He has many years of practical experience in big data architecture design and data warehouse modeling. Before joining Amazon, he was responsible for the architecture design, data warehouse modeling, operation and maintenance of the leading e-commerce big data platform for a long time.

Article source: https://dev.amazoncloud.cn/column/article/630b3f0176658473a3220015?sc_medium=regulartraffic &sc_campaign=crossplatform &sc_channel=CSDN