How to build a new generation of real-time lake warehouse? Kangaroo Cloud’s exploration and upgrade path based on data lake

In the previous series of real-time lake warehouse articles, we have introduced the importance of real-time lake warehouse for the current digital transformation of enterprises, the functional architecture design of real-time lake warehouse, and the application scenarios of the combination of real-time computing and data lake.

In this article, we will introduce the exploration and implementation practice of Kangaroo Cloud Data Stack in building a real-time lake warehouse system, as well as future plans.

Why does Shuzhan choose real-time lake warehouse

As a data development platform, Data Stack provided a development model based on Lambda architecture before the introduction of real-time lake warehouse, which was divided into two links: real-time and offline. The problems caused by this development model are:

· High complexity, requiring maintenance of different components of streaming and batch dual links

· The storage cost is high, and the two links of the streaming batch maintain two copies of the same data.

· The real-time link cannot be checked, and it is difficult to query Kafka’s intermediate data. Random query is not supported, and only sequential query is supported.

· The consistency of data caliber is poor, and it is difficult for different computing engines to ensure unified data caliber.

file

The real-time lake warehouse can save storage costs, greatly improve development efficiency, and mine data value faster and better.

· Provides diversified analysis capabilities, not limited to batch processing and stream processing, and is friendly in interactive query and machine learning

· Provides ACID transaction capabilities, which can better ensure data quality and provide functions such as addition, deletion, modification, and query. Traditional data warehouses lack this capability.

· Provides complete data management capabilities, including data format, data Schema, etc.

· Provides scalable storage media capabilities, supporting HDFS, object storage, cloud storage, etc.

Data stack’s practice based on real-time lake warehouse

The following figure is the structure diagram of the data stack solution based on real-time lake warehouse:

file

The data in the business library is collected and put into the lake in real time through ChunJun, a self-developed data integration framework. Currently, Iceberg/Hudi is supported to be put into the lake in real time. After that, business development is carried out in the real-time development platform and offline development platform of the data stack. Flink and Spark support docking with Iceberg/Hudi and Iceberg/Hudi source indicator display. Data management is then carried out through the EasyLake integrated lake and warehouse platform, such as one-click table transfer, lake table management, etc.

Based on this, the real-time lake warehouse well solves the pain points caused by the Lambda architecture development model mentioned above. It realizes the integration of streaming and batching in the storage layer and computing layer, enables real-time link intermediate data to be checked, unified data caliber, and low-cost storage, bringing a faster, more flexible, and more efficient data processing experience to enterprises. This is the data stack. The reason for the principle of real-time lake warehouse.

The following will focus on introducing real-time access to the lake and materialized views.

CDC real-time into the lake

Flink CDC is a CDC technology based on database logs, which implements a data integration framework for full incremental integrated reading. Coupled with Flink’s excellent pipeline capabilities and rich upstream and downstream ecosystem, Flink CDC can efficiently achieve real-time integration of massive data. However, real-time input of CDC data into the lake also faces considerable challenges:

· High real-time performance: CDC data has high real-time requirements. The higher the freshness of the data, the higher the business value.

· Large amount of historical data: The database has a large amount of historical data

· Strong consistency: Data processing must ensure orderliness and the results need to be consistent

· Schema dynamic evolution: the Schema corresponding to the database will continue to change with the business

So, how is the number stack made?

file

Kangaroo Cloud’s self-developed data integration framework ChunJun supports the collection of CDC data, including MySQL CDC, Oracle CDC, PG CDC, and SQLServer CDC. After the CDC data is collected, it is written into Iceberg/Hudi Sink to complete the work of entering the lake in real time.

The entire link and architecture created in this way are independently developed by Kangaroo Cloud and are fully controllable, achieving full incremental integration and minute-level latency without any impact on business stability.

ChunJun: https://github.com/DTStack/chunjun.git

Problems in real-time landing in the lake

During the process of landing in the lake in real time, we also encountered problems and challenges:

· Small file problem: Small files affect reading and writing efficiency, resulting in poor HDFS cluster stability

· Hudi adapts to Flink1.12: Most of the Flink versions used by customer groups are still stuck at 1.12

· Cross-cluster access to the lake: In the scenario of multiple Hadoop clusters, there are cross-cluster requirements.

How did the number stack break through these problems one by one?

● Optimization of small file issues

· Properly set Checkpoint Interval

The entire compaction process is an operation process with a lot of I/O. If the Checkpoint Interval is blindly adjusted, a series of problems will occur, such as small file problems, increased HDFS pressure, checkpoint failure, and task instability.

After verification by many parties, it is recommended to set the Checkpoint Interval to 1-5 minutes.

· Platform-based small file management

Adjusting Checkpoint can alleviate the generation of small files, and then carry out platform-based small file management to fundamentally solve the problem.

EasyLake’s lake table management function supports data file management, snapshot file management, and Hudi MOR incremental file merging. It controls the number of small files within a certain range and improves management efficiency.

file

● How to adapt Hudi to Flink version 1.12

The data stack is not a blank sheet of paper in this regard. First, we developed hudi-flink1.12 based on the hudi-flink1.13.x module, modified the Flink version to 1.12.7, and then repaired the incompatible points one by one. Finally, Carrying out a complete functional test completes the adaptation work.

file

· Cross-cluster lake access solution

Hudi and Iceberg Sink obtain core-site.xml and hdfs-site.xml from the HADOOP_CONF_DIR environment variable by default to access the corresponding HDFS.

The data stack is based on the self-developed ChunJun. In ChunJun iceberg-connector and hudi-connector, the acquisition method of hadoop conf dir has been expanded to support custom parameters by specifying hadoopConfig.

This allows data to flow between clusters, breaks down data silos, and supports cross-cluster access to the lake.

file

ETL Accelerated Exploration-Materialized View

Before introducing the exploration of materialized views by the data stack, we must first understand why we need materialized views?

The real-time lake warehouse contains three types of tasks, real-time ETL, offline ETL and OLAP. During the processing of ODS -> ADS, the above three types of tasks will have more and more aggregation operations, IO will become more and more intensive, and multiple tasks will occur. Phenomenon such as SQL fragments with the same logic in SQL.

Materialized views can pre-compute time-consuming results such as table connections or aggregations and save the calculation results. When querying complex SQL, calculations are directly based on the pre-calculated results of the previous step, thereby avoiding time-consuming Operate and get results faster.

The materialized view built based on the data lake in the real-time lake warehouse can realize sharing between stream, batch and OLAP tasks, thereby further reducing the delay of data in the real-time data lake along the entire link. Accelerate real-time processing links, save computing costs, and improve query performance and response speed.

● What needs to be completed for the implementation of the materialized view in the real-time lake warehouse

· Platform-based data lake materialized view management

· Spark supports managing materialized views based on data lake table format

· Trino supports managing materialized views based on data lake table format

· Flink supports managing materialized views based on data lake table format

At present, the real-time lake warehouse of the data stack has completed the Spark and Trino parts. Later, all four parts will be implemented to give full play to the role of materialized views.

● Implementation principle of materialized view

file · Create materialized view syntax

CREATE MATERIALIZED VIEW (IF NOT EXISTS)? multipartIdentifier
          ('(' colTypeList ')')? tableProvider?
          ((OPTIONS options=tablePropertyList) |
           (PARTITIONED BY partitioning=partitionFieldList) |
           skewSpec |
           bucketSpec |
           rowFormat |
           createFileFormat |
           locationSpec |
           commentSpec |
           (TBLPROPERTIES tableProps=tablePropertyList))*
          AS query 

· Example

CREATE MATERIALIZED VIEW mv
AS SELECT
  a.id,
  a.name
FROM jinyu_base a
JOIN jinyu_base_partition b
ON a.id = b.id;

Future Planning

Kangaroo Cloud’s practical journey based on real-time lake warehouse goes far beyond this. In the future, more and deeper explorations will be carried out to provide enterprises with more efficient, flexible and intelligent data processing solutions.

· Ease of use: Increase the ease of use of platform lake table management

· Introducing Paimon: The platform supports docking with Paimon and adds integrated lake and warehouse construction based on Paimon.

· Improve the performance of entering the lake: Go deep into and enhance the kernel to improve the performance of entering the lake

· Security exploration: The real-time lake warehouse provides data sharing, supports multiple engines, and explores the security management solution of the real-time lake warehouse.

This article is based on the summary of the live content of “Five Lectures on Real-time Lake Warehouse Practice, Issue 3”. Interested friends can click on the link to watch the live replay video and obtain the live courseware for free.

Live courseware:

https://www.dtstack.com/resources/1054?src=szsm

Live video:

https://www.bilibili.com/video/BV1Ee411d7Py/?spm_id_from=333.999.0.0 &

“Dutstack Product White Paper”: https://www.dtstack.com/resources/1004?src=szsm

“Data Governance Industry Practice White Paper” download address: https://www.dtstack.com/resources/1001?src=szsm

Friends who want to know or consult more about Kangaroo Cloud’s big data products, industry solutions, and customer cases, please visit Kangaroo Cloud’s official website: https://www.dtstack.com/?src=szcsdn

At the same time, students who are interested in big data open source projects are welcome to join us to exchange the latest open source technology information, number: 30537511, project address: https://github.com/DTStack