Quickly understand the ClickHouse table engine

The table engine plays an important role in ClickHouse, directly deciding how to store and read data, whether to support parallel read and write, whether to support index, query type, master-slave replication, etc.

ClickHouse provides 4 types of table engines, each supporting different scenarios. For example, the Log series is used for data analysis of small tables, the MergeTree series is used for data analysis of large tables, and the integration series is used for data integration. Considering that so many types are difficult for new users to understand and choose, this article attempts to sort out various engines and deepen their understanding. I hope it will be helpful to you. In addition, the replication series and distributed table engine are more complicated, and we will learn and share in the future.

ClickHouse table engine overview

The following figure shows all the table engines listed so far in the official documentation:

figure 1

log series engine

The function of the log series table engine is relatively simple. It is mainly used to quickly write small-scale data (less than 1 million rows), and then read them all. The Log table engine has several common features:

  • Data is written sequentially to disk
  • Does not support updating and deleting data
  • Indexes are not supported
  • Atomic writes are not supported
  • Blocking query operations on insert

The differences between them are:

  • TinyLog does not support concurrent reading of data files, and the query performance is low; the format is simple and suitable for temporary storage of intermediate data.
  • StripLog supports concurrent reading of data files, which has better query performance than TinyLog; stores all columns in the same file, and has fewer files than TinyLog.
  • Log supports concurrent reading of data files, which has better query performance than TinyLog; each column is stored in an independent file.

Integrated Family Engine

This series of engines is mainly used to import external data to ClickHouse or directly operate external data sources in ClickHouse:

  • Kafka: Directly import Kafka topic data to ClickHouse
  • MySQL: Use MySQL as the storage engine, you can operate MySQL in ClickHouse
  • JDBC/ODBC: read data from data sources specified by jdbc and odbc connection strings
  • HDFS: Directly read data files in the specified format on HDFS

Special series table engine

  • Memory: Store data in internal memory, restart data loss. The query performance is excellent, suitable for small tables of less than 1 million that do not need to be persisted. Used internally by ClickHouse as a temporary table.
  • Buffer: Set the memory buffer for the target table. When the buffer reaches a certain condition, the data will be written to the disk.
  • File: Store data directly in local files.
  • Null: The write data is discarded, and the read data is empty. Typically used as a pipeline with materialized views.

MegerTree series engine

The above-mentioned engines are mainly used for specific purposes, and the usage scenarios have certain restrictions. The MegerTree series engine is the official main storage engine that supports all the core functions of ClickHouse.

The following focuses on the MegerTree, replacingmergetree, CollapsingMergeTree, VersionedCollapsingMergeTree, SummingMergeTree, and AggregatingMergeTree engines.

MegerTree

The MegerTree table engine is mainly used for massive data analysis, supports data partitioning, sequential storage, primary key index, sparse index, data TTL, etc. egerTree has all ClickHouse SQL syntax so far, but some functions are different from MySQL. For example, primary keys do not guarantee data duplication.

The following example creates a MegerTree table test_tbl with the primary key as id and create_time. The data will be sorted and stored according to the primary key. The data will be partitioned according to create_time, and the data will only be saved up to the last month.

CREATE TABLE test_tbl (
  idUInt16,
  create_time Date,
  comment Nullable(String)
) ENGINE = MergeTree()
   PARTITION BY create_time
     ORDER BY (id, create_time)
     PRIMARY KEY (id, create_time)
     TTL create_time + INTERVAL 1 MONTH
     SETTINGS index_granularity=8192;

Write some sample data below, but our sample data has data with duplicate primary keys:

insert into test_tbl values(0, '2023-03-12', null);
insert into test_tbl values(0, '2023-03-12', null);
insert into test_tbl values(1, '2023-03-13', null);
insert into test_tbl values(1, '2023-03-13', null);
insert into test_tbl values(2, '2023-03-14', null);

Query data: Although there are only 3 pieces of data with the same primary key, there are 5 pieces of query results.

SELECT count(*) FROM test_tbl

┌─count()─┐
│ 5 │
└───────────┘
SELECT * FROM test_tbl

┌─id─┬─create_time─┬─comment─┐
│ 2 │ 2023-03-14 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 1 │ 2023-03-13 │  │
│ 1 │ 2023-03-13 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 0 │ 2023-03-12 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 0 │ 2023-03-12 │  │
└─────┴───────────────────────────────────────────────────────────┘

Because MergeTree uses a structure similar to an LSM tree, much of the storage layer processing logic does not occur until during the comparison. Therefore, after forcing background compaction and querying again, there are still 5 pieces of data, but because we have defined partitions, the data has been reorganized.

optimize table test_tbl final;

SELECT * FROM test_tbl

┌─id─┬─create_time─┬─comment─┐
│ 2 │ 2023-03-14 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 1 │ 2023-03-13 │  │
│ 1 │ 2023-03-13 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 0 │ 2023-03-12 │  │
│ 0 │ 2023-03-12 │  │
└─────┴───────────────────────────────────────────────────────────┘

As you can see from the above example, although MegerTree has a primary key, it is mainly used for family query, unlike MySQL that guarantees the uniqueness of records. Even after the comparison is done, data rows with the same primary key are still together.

ReplacingMergeTree

In order to solve the same primary key problem of MegerTree, ClickHouse provides the ReplacingMergeTree engine to achieve deduplication of records, please see the example:

-- Table building
CREATE TABLE test_tbl_replacing (
  idUInt16,
  create_time Date,
  comment Nullable(String)
) ENGINE = ReplacingMergeTree()
   PARTITION BY create_time
     ORDER BY (id, create_time)
     PRIMARY KEY (id, create_time)
     TTL create_time + INTERVAL 1 MONTH
     SETTINGS index_granularity=8192;

--Write duplicate primary key data
insert into test_tbl_replacing values(0, '2023-03--12', null);
insert into test_tbl_replacing values(0, '2023-03--12', null);
insert into test_tbl_replacing values(1, '2023-03--13', null);
insert into test_tbl_replacing values(1, '2023-03--13', null);
insert into test_tbl_replacing values(2, '2023-03--14', null);
SELECT *
FROM test_tbl_replacing

Query id: 8b96a5d3-5089-4721-9b88-9ec86ae4816a

┌─id─┬─create_time─┬─comment─┐
│ 2 │ 2023-03-14 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 0 │ 2023-03-12 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 1 │ 2023-03-13 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 0 │ 2023-03-12 │  │
└─────┴───────────────────────────────────────────────────────────┘

-- Force background compaction:
optimize table test_tbl_replacing final;

SELECT *
FROM test_tbl_replacing

Query id: 33efdfe6-c8f1-4428-8307-352ee4c1d71b

┌─id─┬─create_time─┬─comment─┐
│ 2 │ 2023-03-14 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 1 │ 2023-03-13 │  │
└─────┴───────────────────────────────────────────────────────────┘
┌─id─┬─create_time─┬─comment─┐
│ 0 │ 2023-03-12 │  │
└─────┴───────────────────────────────────────────────────────────┘

Although replacingmergetree provides deduplication function, there are still certain limitations:

  • Before full optimization, the primary key deduplication function cannot be effectively implemented. For example, some data has been deduplicated, but other data may not be deduplicated;
  • In a distributed scenario, the same primary key data may be scattered in shards of different nodes, and records in different shards cannot be deduplicated;
  • Post-stage optimization actions, the execution time is uncertain;
  • Manual optimization takes a long time in big data scenarios and cannot meet real-time business needs;

Therefore, the replacingmergetree engine is more useful in the scenario where the data is finally guaranteed to be deduplicated. During the query process, the primary key data cannot be guaranteed to be deduplicated.

CollapsingMergeTree

ClickHouse implements the CollapsingMergeTree (folding and merging tree) engine in order to eliminate the limitations of ReplacingMergeTree. The engine requires a label column: Sign, specified when creating the table. When compared in the backend, rows with the same primary key and opposite Sign will be collapsed, that is, deleted.

CollapsingMergeTree data lines are divided into two categories according to Sign: Sign=1 is called status line, and Sign=-1 is called cancel line. Every time you need to write data, you need a new status line, and vice versa, you need to cancel the line when you delete data.

Status lines and cancel lines are automatically collapsed (removed) when compared in the backend. If the comparison has not yet started, both the status line and the cancel line exist. Therefore, in order to realize that records with the same primary key are folded, the business once needed corresponding logic support:

  1. When performing a delete operation, the row needs to be canceled. The canceled row needs to contain the same data as the original status row record (except for the Sign column). Therefore, the application layer needs to record the value of the original status row record, or query the database to obtain the record value before deletion.
  2. Because the back-end comparison time cannot be predicted, when the query starts, the status line and the cancellation line have not yet started to be folded, and ClickHouse cannot guarantee that the same primary key records must be on the same node, and data that is not on the same node cannot be folded. Therefore, redundant data will be generated when the count() and Sum(col) aggregation functions are executed. In order to obtain correct results, the business once needed to rewrite the SQL logic. Count() was changed to count(Sign), and Sum(col) was changed to sum(colSign). Let’s illustrate by example:
-- Table building
CREATE TABLE UAct
(
    UserID UInt64,
    PageViews UInt8,
    Duration UInt8,
    Sign Int8
)
ENGINE = CollapsingMergeTree(Sign)
ORDER BY UserID;

-- Insert status line, note sign The value of a column is 1
INSERT INTO UAct VALUES (4324182021466249494, 5, 146, 1);

-- Insert a cancel line to offset the above status line. be careful sign The value of a column is-1,Other values are consistent with the status line;
-- And insert a new status row with the same primary key, which is used to PageViews Update from 5 to 6, will Duration Update from 146 to 185.
INSERT INTO UAct VALUES (4324182021466249494, 5, 146, -1), (4324182021466249494, 6, 185, 1);


SELECT * FROM UAct

Query id: c4ca984b-ac9d-46df-bdba-b2cf4e98dc1f

┌────────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
│ 4324182021466249494 │ 5 │ 146 │ -1 │
│ 4324182021466249494 │ 6 │ 185 │ 1 │
─ ──────────────────────────────────────────e ─────── ──┘
┌────────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
│ 4324182021466249494 │ 5 │ 146 │ 1 │
─ ──────────────────────────────────────────e ─────── ──┘


-- In order to get the correct sum result, we use the following SQL:


SELECT
    UserID,
    sum(PageViews * Sign) AS PageViews,
    sum(Duration * Sign) AS Duration
FROM UAct
GROUP BY UserID
HAVING sum(Sign) > 0

Query id: 829e4c7f-11af-47fc-b8d9-3f3a3105d491

┌────────────────UserID─┬─PageViews─┬─Duration─┐
│ 4324182021466249494 │ 6 │ 185 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


--Force background Compaction
optimize table UAct final;

SELECT * FROM UAct

┌────────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
│ 4324182021466249494 │ 6 │ 185 │ 1 │
─ ──────────────────────────────────────────e ─────── ──┘

Although CollapsingMergeTree solves the problem of instantly deleting data with the same primary key, in the case of continuous state changes and multi-threaded parallel writing, the positions of status lines and cancellation lines may be out of order, resulting in failure to fold normally.

The following example is an example of out-of-order records that cannot be folded:

-- Table building
CREATE TABLE UAct_order
(
    UserID UInt64,
    PageViews UInt8,
    Duration UInt8,
    Sign Int8
)
ENGINE = CollapsingMergeTree(Sign)
ORDER BY UserID;

-- Insert cancel line first
INSERT INTO UAct_order VALUES (4324182021466249495, 5, 146, -1);
--Insert status line after
INSERT INTO UAct_order VALUES (4324182021466249495, 5, 146, 1);

--force Compaction
optimize table UAct_order final;

-- You can see that even if Compaction After that, the primary key cannot be folded: 2 Row data still exists.
select * from UAct_order;
┌────────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
│ 4324182021466249495 │ 5 │ 146 │ -1 │
│ 4324182021466249495 │ 5 │ 146 │ 1 │
─ ──────────────────────────────────────────e ─────── ──┘

VersionedCollapsingMergeTree

In order to solve the problem that the CollapsingMergeTree engine cannot be folded out of order, the VersionedCollapsingMergeTree engine adds a version column when creating a table, which is used to record the correspondence between the status line and the cancel line when writing out of order. Records with the same primary key, same version, and opposite Sign are deleted during comparison.

Similar to CollapsingMergeTree, in order to obtain correct results, the business still needs to rewrite SQL, and modify count(), sum(col) to sum(Sign), sum(col * Sign) respectively. See example:

-- Table building
CREATE TABLE UAct_version
(
    UserID UInt64,
    PageViews UInt8,
    Duration UInt8,
    Sign Int8,
    Version UInt8
)
ENGINE = VersionedCollapsingMergeTree(Sign, Version)
ORDER BY UserID;


-- Insert a row first to cancel the row, note Signz=-1, Version=1
INSERT INTO UAct_version VALUES (4324182021466249494, 5, 146, -1, 1);

-- Insert a status line after, note Sign=1, Version=1;
-- And a new status line note Sign=1, Version=2, take PageViews Update from 5 to 6, will Duration Update from 146 to 185.
INSERT INTO UAct_version VALUES (4324182021466249494, 5, 146, 1, 1),(4324182021466249494, 6, 185, 1, 2);


-- For comparing previous queries, display all rows.
SELECT * FROM UAct_version;
┌────────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
│ 4324182021466249494 │ 5 │ 146 │ -1 │
│ 4324182021466249494 │ 6 │ 185 │ 1 │
─ ──────────────────────────────────────────e ─────── ──┘
┌────────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
│ 4324182021466249494 │ 5 │ 146 │ 1 │
─ ──────────────────────────────────────────e ─────── ──┘


-- In order to get the correct result, rewrite the following SQL:
-- sum(PageViews) => sum(PageViews * Sign),
-- sum(Duration) => sum(Duration * Sign)
SELECT
    UserID,
    sum(PageViews * Sign) AS PageViews,
    sum(Duration * Sign) AS Duration
FROM UAct_version
GROUP BY UserID
HAVING sum(Sign) > 0;
┌────────────────UserID─┬─PageViews─┬─Duration─┐
│ 4324182021466249494 │ 6 │ 185 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘


-- force backend comparison
optimize table UAct_version final;


-- After the query, even if the order is out of order, the correct result is still obtained.
select * from UAct_version;
┌───────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┬─Version─┐
│ 4324182021466249494 │ 6 │ 185 │ 1 │ 2 │
─ ──────────────────────────────────────────e ─────── ──┴─────────────┘

SummingMergeTree

The SummingMergeTree engine of ClickHouse provides aggregation and summation according to the primary key column. When comparing at the backend, aggregate multiple rows with the same primary key and replace them with one row, which not only reduces storage space, but also improves the performance of aggregation calculations. Need to pay attention to the following three points:

  • ClickHouse only performs summation by primary key during backend comparison, and the execution time cannot be determined, so some data may not be calculated yet, but other data has been aggregated at the same time. Therefore, the GROUP BY clause is still required in the execution of aggregate calculation SQL.
  • When pre-aggregating and summing, ClickHouse pre-aggregates all columns except the primary key column. If the columns are aggregatable (e.g. numeric type), sum them directly; if they are not aggregatable (e.g. string type), choose a value at random.
  • It is generally recommended to use SummingMergeTree in combination with MergeTree, which stores detailed information, and use SummingMergeTree to store pre-aggregated results to speed up queries.

See example:

-- create table Table
CREATE TABLE summtt
(
    keyUInt32,
    value UInt32
)
ENGINE = SummingMergeTree()
ORDER BY key

-- insert data
INSERT INTO summtt Values(1,1),(1,2),(2,1)

-- Before comparison, the same primary key records exist at the same time
select * from summtt;
┌─key─┬─value─┐
│ 1 │ 1 │
│ 1 │ 2 │
│ 2 │ 1 │
└──────┴─────────┘

-- Use GROUP BY to perform aggregate calculations
SELECT key, sum(value) FROM summtt GROUP BY key
┌─key─┬─sum(value)─┐
│ 2 │ 1 │
│ 1 │ 3 │
└──────┴───────────────┘

-- force comparison
optimize table summtt final;

-- Query after comparison, you can see that the same primary key records have been aggregated
select * from summtt;
┌─key─┬─value─┐
│ 1 │ 3 │
│ 2 │ 1 │
└──────┴─────────┘


-- In actual use, group query is still required
SELECT key, sum(value) FROM summtt GROUP BY key
┌─key─┬─sum(value)─┐
│ 2 │ 1 │
│ 1 │ 3 │
└──────┴───────────────┘

AggregatingMergeTree

AggregatingMergeTree is also a pre-aggregation engine used to improve the performance of aggregation calculations. The difference with SummingMergeTree is that SummingMergeTree aggregates non-primary key columns, while AggregatingMergeTree can specify different aggregation functions.

The syntax of AggregatingMergeTree is a bit more complex and needs to be used with materialized views or specific types of aggregation functions. In insert and select, you need to use the State syntax when inserting, and you need to use the Merge syntax when querying.

See example 1, using a complex view:

--Schedule
CREATE TABLE visits
(
    UserID UInt64,
    CounterID UInt8,
    StartDate Date,
    Sign Int8
)
ENGINE = CollapsingMergeTree(Sign)
ORDER BY UserID;

--Create a materialized view of the schedule, which pre aggregates the schedule
-- Note: the functions used for pre aggregation are: sumState, uniqState. Corresponding to write syntax<agg>-State.
CREATE MATERIALIZED VIEW visits_agg_view
ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate)
AS SELECT
    CounterID,
    StartDate,
    sumState(Sign) AS Visits,
    uniqState(UserID) AS Users
FROM visits
GROUP BY CounterID, StartDate;

-- Insert detail data
INSERT INTO visits VALUES(0, 0, '2019-11-11', 1);
INSERT INTO visits VALUES(1, 1, '2019-11-12', 1);

-- Final aggregation of materialized views
-- Note: the aggregate function used is sumMerge, uniqMerge. Corresponding to query syntax<agg>-Merge.
SELECT
    StartDate,
    sumMerge(Visits) AS Visits,
    uniqMerge(Users) AS Users
FROM visits_agg_view
GROUP BY StartDate
ORDER BY StartDate;

-- Ordinary function sum, uniq No longer available
-- as follows SQL Error will be reported: Illegal type AggregateFunction(sum, Int8) of argument
SELECT
    StartDate,
    sum(Visits),
    uniq(Users)
FROM visits_agg_view
GROUP BY StartDate
ORDER BY StartDate;

Example 2:

--Schedule
CREATE TABLE detail_table
(CounterID UInt8,
    StartDate Date,
    UserID UInt64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(StartDate)
ORDER BY (CounterID, StartDate);

-- Insert detail data
INSERT INTO detail_table VALUES(0, '2019-11-11', 1);
INSERT INTO detail_table VALUES(1, '2019-11-12', 1);

--Create a prepolymerization table,
-- Note: among them UserID The type of a column is: AggregateFunction(uniq, UInt64)
CREATE TABLE agg_table
(CounterID UInt8,
    StartDate Date,
    UserID AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(StartDate)
ORDER BY (CounterID, StartDate);

-- Read data from the parts list and insert the aggregate table.
-- Note: the aggregate function used in the subquery is uniqState, Corresponding to write syntax<agg>-State
INSERT INTO agg_table
select CounterID, StartDate, uniqState(UserID)
from detail_table
group by CounterID, StartDate

--Cannot use normal insert Statement direction AggregatingMergeTree Insert data in.
-- book SQL Error will be reported: Cannot convert UInt64 to AggregateFunction(uniq, UInt64)
INSERT INTO agg_table VALUES(1, '2019-11-12', 1);

-- Query from aggregate table.
-- Be careful: select The aggregate function used in is uniqMerge, Corresponding to query syntax<agg>-Merge
SELECT uniqMerge(UserID) AS state
FROM agg_table
GROUP BY CounterID, StartDate;

Summary

This article introduces the ClickHouse table engine in general, and introduces the MegerTree series table engine in detail according to the process, and verifies it through examples. I hope it will be helpful to you. References: https://programmer.help/blogs/how-to-choose-clickhouse-table-engine.html; official documentation: https://clickhouse.com/docs/en/engines/table-engines#mergetree