Aggregate Key model of Apache Doris data modeling

Understanding the Doris data model is very important for us to use Doris to solve our business problems. In this series, we will introduce in detail the three data models of Doris and some strategies for partitioning and bucketing Doris data to help users better use Doris.

In this series, I will explain Doris’s three data models and Rollup, materialized views and prefix indexes based on these three data models. There are also data partitioning and bucketing strategies based on these three data models.

We know that in Doris, data is logically described in the form of tables. A table includes rows and columns. Row is a row of user data. Column is used to describe different fields in a row of data.

Columns can be divided into two categories: Key and Value. From a business perspective, keys and values correspond to dimension columns and indicator columns.

Doris provides three data models for different scenarios

Aggregate (aggregation model): The columns in the table are divided into Key and Value. The data will be grouped according to the dimension columns and the indicator columns will be aggregated.
Unique (unique primary key model): This model generally requires Doris data and business data to be consistent when connecting to the business system RDS, and supports Upsert, Delete and other operations.
Duplicate (detailed model): This model does not guarantee the uniqueness of the data (if you import the data twice, it will be duplicated). The data has neither a primary key nor aggregation requirements. Generally, we use this model.

Aggregation model features

A normal model will definitely store detailed data in a database, that is, in Doris. But because Doris was first made for an advertising report for Fengchao. Advertising reports have a big feature, that is, they only care about the results of statistical analysis and don’t care much about detailed data. Therefore, Doris’s earliest generation data model, It is an aggregated model.

The characteristic of the aggregation model is that the columns in the table are divided into two types: Key and Value. Key is the dimension column of the data, such as time, region, etc. Value is the indicator column of the data, such as clicks, costs, etc. Each indicator column will also have its own aggregation function, including sum, min, max, bitmap_union, etc.. Data is grouped by dimension columns and aggregated on metric columns. As shown below:

From the above figure we can see that this is a typical fact table of user information and access behavior. In a general star schema, user information and access behavior are generally stored in dimension tables and fact tables respectively. Here, in order to explain Doris’ data model more conveniently, we store the two parts of information in one table.

In this table, we use: user_id, date, city, age, sex to count the user’s last visit time, the user’s total consumption, the user’s maximum stay time, and the minimum stay time.

The columns in the table are divided into Key (dimension columns) and Value (indicator columns) according to whether IndexKeysType is set or AGG_KEYS indicates an aggregation model.

Here our Key column is true, which means that this field is a Key column, and false means a Value column. For all value columns, we specify their aggregation type (AggregationType) when creating the table. )

The corresponding table creation statement above is as follows:

CREATE TABLE example_tbl_02
(
    `user_id` LARGEINT NOT NULL COMMENT "userid",
    `date` DATE NOT NULL COMMENT "Data injection date and time",
    `city` VARCHAR(20) COMMENT "User's city",
    `age` SMALLINT COMMENT "User age",
    `sex` TINYINT COMMENT "user gender",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "User's last visit time",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "Total user consumption",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "User's maximum stay time",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "User minimum stay time"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
);

Let’s insert some data:

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017 -10-01	Beijing	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

In the above data, we can see that the data keys of the first two rows are completely consistent. The subsequent Value field should automatically complete data aggregation according to the aggregation method specified when we created the table. We execute the following statement to insert data and see if As we expected:

insert into example_tbl_02 values
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

View the data after our final execution through the following figure

Aggregation methods supported by aggregation models

When we import data, rows with the same Key column will be aggregated into one row, and the Value column will be aggregated according to the set AggregationType. AggregationType currently has the following aggregation methods and agg_state:

SUM: Sum, the Values of multiple rows are accumulated.
REPLACE: Replacement, the Value in the next batch of data will replace the Value in the previously imported row.
MAX: Keep the maximum value.
MIN: Keep the minimum value.
REPLACE_IF_NOT_NULL: non-null value replacement. The difference from REPLACE is that null values are not replaced. This aggregation method is suitable for partial column updates.
HLL_UNION: The aggregation method of HLL type columns, aggregated through the HyperLogLog algorithm.
BITMAP_UNION: The aggregation method of BIMTAP type columns, performing union aggregation of bitmaps.

Aggregation model usage scenarios

We know that we need to specify the data model when creating the table. Once the table is created, the data model cannot be modified later. If you do not specify the data model when creating the table, the default model is the detailed model (Duplicate Key).

The Aggregate model can greatly reduce the amount of data scanned and the amount of calculation required for aggregation queries through pre-aggregation, making it very suitable for report query scenarios with fixed patterns. But this model is not friendly to count(*) queries. At the same time, because the aggregation method on the Value column is fixed, semantic correctness needs to be considered when performing other types of aggregation queries.
In the field of data analysis, there are many scenarios that require statistics and summary operations on data. For example, it is necessary to analyze website and APP access traffic, and count the total access time and total number of accesses. Or for example, a business needs to provide advertisers with indicators such as the total traffic of ad clicks, total exhibition volume, and consumption statistics. In these scenarios where detailed data does not need to be saved and only data aggregated according to fixed dimensions and indicators is required, an aggregation model can usually be used.
In addition, in partial column update scenarios, if the user needs both unique primary key constraints and partial column updates (for example, when importing multiple source tables into one doris table), you can consider using the Aggregate model and changing the aggregation type of non-primary key columns. Set to REPLACE_IF_NOT_NULL, but the Doris 2.0 Unqiue Key model begins to support partial column updates. If you do not enable row storage, partial column updates in big data tables may consume a lot of CPU for you. At this time, if you do not have special For your needs, you can also use an aggregation model. The Unique Key model is actually a special case of the aggregation model.
Detailed data is not saved in the aggregation model. If you need to save detailed data, it is not suitable to use the aggregation model. It is recommended to choose the Unqiue Key or Duplicate Key model.

Aggregation model limitations

In the aggregation model, what the model displays externally is the finally aggregated data. In other words, any data that has not yet been aggregated (such as data from two different imported batches) must be displayed in a certain way to ensure consistency. In particular, count calculations on the aggregated model may lead to The result is inaccurate, how can we solve this situation?

The first method: Add a count column and import the data. The value of this column is always 1. Then the result of select count(*) from table; is equivalent to select sum(count) from table;. The query efficiency of the latter will be much higher than that of the former. However, this method also has usage limitations, that is, users need to ensure that they do not repeatedly import rows with the same AGGREGATE KEY columns. Otherwise, select sum(count) from table; can only express the number of rows imported originally, not the semantics of select count(*) from table;.

Another way: Change the aggregation type of the count column above to REPLACE, and the value will still be 1. Then the results of select sum(count) from table; and select count(*) from table; will be consistent. And this way, there is no restriction on importing duplicate rows.