How to quickly go from ETL to ELT? Volcano Engine ByteHouse does these three things

For more technical exchanges and job opportunities, please follow ByteDance Data Platform WeChat official account, reply [1] to enter the official communication group

Foreword

When it comes to enterprise analysis scenarios, the data used usually comes from a variety of business data. Most of these data systems use behavior-based storage structures, such as payment transaction records, user purchasing behavior, sensor alarms, etc. In the field of data warehouse and analysis, massive data is mainly stored in columns. Therefore, converting data from row-level to column-level storage is the basic capability for establishing an enterprise data warehouse.

The traditional way is to use Extract-Transform-Load (ETL) to convert business data into a data model suitable for the data warehouse. However, this relies on an ETL system independent of the data warehouse, so the maintenance cost is high. However, with the advent of the cloud computing era, cloud data warehouses have greater scalability and computing capabilities, which also require changes in the traditional ELT process.

Volcano Engine ByteHouse is a cloud-native data warehouse based on the open source ClickHouse. It provides users with an extremely fast analysis experience and can support real-time data analysis and massive data offline analysis. It also has convenient elastic expansion and contraction capabilities, ultimate analysis performance and rich enterprise-level features. With its powerful computing power, it can fully support the Extract-Load-Transform (ELT) capability, freeing users from maintaining multiple heterogeneous systems.

Specifically, users can import data and use customized SQL statements to perform data conversion within ByteHouse without relying on independent ETL systems and resources. In this way, users only need to use a unified SQL method to complete data conversion operations.

In this article, we will focus on the challenges encountered by ByteHouse and how to achieve complete ELT capabilities through 3 major capability building.

Pain points and challenges

Let’s start with a simple SSB (start-schema-benchmark) scenario, which includes:

  • 1 fact table: linearorder

  • 4 dimension tables: customer, part, supplier, dwdate

In SSB query analysis, we found that most queries involve the join of fact tables and dimension tables, so the fact table can be “flattened” through the Transform step. The SQL used for leveling is as follows:

insert into ssb_flat </code><code>select * from</code><code>lineorder l</code><code>join customer c on l.lo_custkey = c.c_custkey</code><code> join part p on l.lo_partkey = p.p_partkey</code><code>join supplier s on l.lo_suppkey = s.s_suppkey</code><code>where l.lo_orderdate = <bizdate>

Subsequent query analysis can avoid many join operations by scanning a single table of ssb_flat, and its performance can be significantly improved. This “evening” process is a type of “Transform”. There will be more and more complex cases of “Transform” in actual production scenarios. But through the above “evening” process, we can analyze the common pain points of this type of operation on the database.

Compared with ordinary queries, there are several major differences between transformation operations:

  1. The transformation operation takes a long time to execute and the overall retry cost is high.

  2. The transformation operation has no return value, we only care about its success or failure.

  3. The change operation requires a large amount of reading and writing and takes up resources.

Specifically:

  • First of all, for ByteHouse, the temporary query time it is good at is at the second level. If a failure occurs during the query, an error is generally returned directly and handed over to the upstream for retry. In an ETL scenario, if a task fails for some reason after being executed for 50 minutes, retrying is equivalent to wasting the resources of the first 50 minutes, which is obviously unacceptable.

  • Secondly, since ETL does not return results, the client needs to maintain an idle long link, which is likely to time out due to configuration reasons. At the same time, a large number of concurrent tasks will also eat up normal link resources.

  • Finally, due to the large amount of reading and writing of ETL tasks, when multiple tasks are concurrent, resource allocation needs to be considered to achieve a balance between performance and isolation.

To address these three pain points, ByteHouse has specifically designed three functions, namely long task management, asynchronous submission and query queue.

Function 1: Long task management

Normally, we can use settings max_execution_time to control the timeout of a query. ByteHouse provides transaction support to ensure the atomicity of read and write operations.

But this is not enough to cover the needs of ETL tasks. During long-term task execution, it is more likely to encounter systemic failures, such as node OOM, etc. In this case, having the client retry is not an elegant solution.

In ByteHouse, a SQL query is converted into a series of operators. We hope to improve the fault tolerance of operators to better cope with system failures under long-term queries. In the current version, ByteHouse already provides disk spill functions for aggregation, sorting, association and other operators. Specifically, when an operator cannot obtain enough memory, we allow this operator to cache part of the data on the disk so that it can still complete the work under resource constraints.

For example, in the sorting operator, we introduced the ability of external merge sort and controlled the external sorting ability through max_bytes_before_external_sort. In the figure below, the left side is the sorting query plan without spill enabled, and the right side is the plan with spill enabled.

You can see that after external sort is turned on, ByteHouse introduces two operators, BufferingToFileTransform and MergingSortedTransform. Similarly, aggregation and association operators in ByteHouse have undergone similar optimizations such as grace hash join, etc.

Next, ByteHouse also plans to further improve the fault tolerance of shuffle operations for exchange operations.

Feature 2: Asynchronous submission capability

When faced with a large number of long-term ETL tasks, the traditional synchronous execution method requires the client to wait for the server to return. In this way, it is easy for the client to time out, which will affect the execution of subsequent tasks.

At the same time, in this scenario, users do not care about the response time of a single task or request. They only expect the task to be completed within a specific time and have high requirements for reliability. Therefore ByteHouse provides the ability to submit tasks asynchronously.

ByteHouse users can now submit an asynchronous task via setting enable_async_execution. After receiving such a task, ByteHouse will return an asynchronous task ID, such as ff46fccf-d872-4c68-bdb2-c8c18fc178f5. The client can then choose intermittent polling to obtain the final status of the task.

ByteHouse provides the show async status 'ff46fccf-d872-4c68-bdb2-c8c18fc178f5' command to obtain the status. At the same time, ByteHouse also provides the kill query 'ff46fccf-d872-4c68-bdb2-c8c18fc178f5' command to cancel some asynchronous queries.

Function 3: Query Queue

When offline processing faces a large number of requests and the system is overloaded, a certain queuing mechanism is required to suspend query requests and wait for the cluster to release resources before scheduling. ByteHouse provides query queue capabilities for this purpose.

ByteHouse allows users to define a queue from three dimensions, namely: queue size, total CPU usage, and total memory usage.

In ByteHouse, the Resource Manager component can be used to monitor the query indicators in each queue to obtain the resource usage of the queue. When a user submits a query to a queue, ByteHouse will enqueue the query if the queue has not reached the upper limit, otherwise it will reject the query.

After that, ByteHouse will always check the resource utilization of the queue. When the idle resources are higher than a certain threshold, ByteHouse will dequeue the waiting query. When a waiting query is canceled, ByteHouse will also dequeue it. With query queues, users can have more freedom when orchestrating ETL tasks without worrying about overloading the underlying resources.

Later ByteHouse also plans to add priority queue functionality. At that time, users can create different queue priorities for ETL tasks and real-time queries, so that ELT tasks and real-time queries can run in the same calculation group without significant impact on each other.

Summary

The above introduces some technical details of ByteHouse’s ability to support ETL. Among them, long task, asynchronous submission and queue functions have been launched in the preview version. Next, ByteHouse will continue to expand ETL capabilities, including supporting more ETL-related conversion functions, long task fault tolerance, priority queues, etc.

In addition to ELT capabilities, Volcano Engine ByteHouse is based on an exclusively self-developed high-availability engine and query optimizer, which can provide enterprises with fast, stable, and secure query services and data writing performance. Under the cloud-native architecture, Volcano Engine ByteHouse provides an extremely scalable unified data analysis platform with excellent elastic scaling and scalability, ensuring that resources can be flexibly expanded horizontally;

At the same time, ByteHouse supports multi-level resource isolation, providing more secure security for user resources. Volcano Engine ByteHouse also provides complete operation and maintenance monitoring and troubleshooting capabilities from a business perspective, helping enterprises to implement business cloud hosting and reduce operation and maintenance costs. Welcome to the Volcano Engine ByteHouse official website to experience it.