Optimizing ClickHouse Star Schema Query Performance

Most data environments include two types of data objects, event-based objects and entity-based objects (including attributes or characteristics). The former is usually organized as a time series table, and the latter is usually based on ID tables and row storage. This type of model is called a star normalized model, which has high storage efficiency but low read performance. Denormalizing structures when necessary improves performance.

Star Schema

Let’s explore how ClickHouse uses join to query star schema data. Suppose there are two tables: log (time series events), users (entities including features):

picture

The log event table uses user_id to refer to the users table. The event log table has 250m rows of data, and the structure is as follows:

CREATE TABLE log (dt DateTime, user_id UInt64, session_id UInt64, platform String, label String, url String)
ENGINE = MergeTree ORDER BY (dt, user_id, session_id)

What needs to be reminded is that the feature table usually needs to be modified (usually need to add, delete or modify data), while the event table is historical static data (data is only appended and will not change). For simplicity purposes, the users table structure is as follows:

CREATE TABLE users
(`id` UInt64, `name` String, `category` String, `age` UInt8 )
ENGINE = MergeTree ORDER BY (category, id)

Insert demo data:

insert into users
select number , randomPrintableASCII(5) as name, array('aa','bb','cc')[(number)%3 + 1] as c, number` + 18 as age
from numbers(2500000) n ;

Execute the log table 10 times in a row to generate test data:

insert into log
select (now() - number) dt , number id , number + 1 sid
      ,array('win','mac','linux')[(number)%3 + 1] as c
      ,randomPrintableASCII(5) as label
      ,randomPrintableASCII(10) as url
from numbers(2500000) n ;

select count(*) from log;

Star join

Let’s try a practical example – query events for a certain type of users and their age in a certain range:

SELECT date(dt) AS day, count(*) FROM log l
JOIN users u ON u.id = l.user_id
WHERE
  u.category = 'aa' AND u.age > 50 AND dt > NOW() - INTERVAL 1 YEAR
GROUP BY day ORDER BY day DESC LIMIT 5

The log table has 250m rows, and the user has about 25m rows. The query results are as follows:

┌────────day─┬─count()─┐
│ 2023-04-28 │ 86546 │
│ 2023-04-27 │ 129600 │
│ 2023-04-26 │ 129600 │
│ 2023-04-25 │ 129600 │
│ 2023-04-24 │ 129600 │
└────────────────────────────────────────────────┘

5 rows in set. Elapsed: 1.564 sec. Processed 27.50 million rows, 350.00 MB (17.59 million rows/s., 223.83 MB/s.)

The following query execution through explain:

EXPLAIN
SELECT date(dt) AS day, count(*) FROM log l
JOIN users u ON u.id = l.user_id
WHERE
  u.category = 'aa' AND u.age > 50 AND dt > NOW() - INTERVAL 1 YEAR
GROUP BY day ORDER BY day DESC LIMIT 5

Use in instead of join

In most scenarios, you can use in instead of join, but make sure that the nested table does not return too much data, which cannot exceed the effective memory of ClickHouse. In our example, filtering users returns about 15% of the total.

SELECT count(*)
FROM users
WHERE (category = 'aa') AND (age > 50)

Query id: f54846e0-65c3-44d2-8c05-1927adb28ef4

┌─count()─┐
│ 374997 │
└───────────┘

1 rows in set. Elapsed: 0.014 sec. Processed 835.58 thousand rows, 10.03 MB (60.45 million rows/s., 725.45 MB/s.)

The following uses in instead of join:

SELECT
    date(dt) AS day,
    count(*)
FROM log
WHERE (user_id IN (
    SELECT id
    FROM users
    WHERE (category = 'aa') AND (age > 50)
)) AND (dt > (NOW() - toIntervalYear(1)))
GROUP BY day
ORDER BY day DESC
LIMIT 5

Query id: fb871262-d196-4043-9acb-65476a1ee8be

┌─────────day─┬─count()─┐
│ 2023-04-28 │ 86546 │
│ 2023-04-27 │ 129600 │
│ 2023-04-26 │ 129600 │
│ 2023-04-25 │ 129600 │
│ 2023-04-24 │ 129600 │
└────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.166 sec. Processed 25.84 million rows, 316.71 MB (155.68 million rows/s., 1.91 GB/s.)

From the perspective of processing data volume and execution time, it has been greatly improved. Although this example works well, we cannot guarantee that the in expression will have a large performance improvement. At this time, we can consider using denormalization.

Denormalized data

Denormalization means creating a single table based on the join query table, the following creates and populates a single denormalized table:

CREATE TABLE log_users (
  `dt` DateTime, `user_id` UInt64, `session_id` UInt64,
  `platform` String, `label` String, `url` String,
  `user_name` String, `user_category` String, `user_age` UInt8
)
ENGINE=MergeTree
ORDER BY (dt, user_id, session_id, user_category, platform, label)

As you can see, we use more sort keys to cover more query scenarios. Fill in the data below:

INSERT INTO log_users
SELECT l.*, u.name, u.category, u.age
FROM log l JOIN users u ON (l.user_id = u.id)

Execute the previous query again, and the performance details have been improved:

SELECT
date(dt) As day, count(*)
FROM log_users
WHERE(user_category = 'aa') AND(user_age > 50) AND (dt > (now() - toIntervalMonth(1)))
GROUP BY day
ORDER BY day DESC
LIMIT 5


Query id: 2b7575bc-1244-4042-907f-4f6ee8929ade

┌─────────day─┬─count()─┐
│ 2023-04-28 │ 86546 │
│ 2023-04-27 │ 129600 │
│ 2023-04-26 │ 129600 │
│ 2023-04-25 │ 129600 │
│ 2023-04-24 │ 129600 │
└────────────────────────────────────────────────┘

5 rows in set. Elapsed: 0.039 sec. Processed 5.25 million rows, 83.99 MB (134.61 million rows/s., 2.15 GB/s.)

Clearly, having all fields in a single table is a powerful optimization tool, including better tuning of sort keys and projections. But don’t forget that more disk space is required.

Real-time denormalization

The previous example needs to wait during the denormalization process, we can use the materialized view to achieve real-time population:

CREATE MATERIALIZED VIEW log_users_rt (
  `dt` DateTime, `user_id` UInt64, `session_id` UInt64,
  `platform` String, `label` String, `url` String,
  `user_name` String, `user_category` String, `user_age` UInt8
)
ENGINE = MergeTree ORDER BY (dt, user_id, session_id) AS
SELECT l.*, u.name AS user_name, u.category AS user_category,
       u.age AS user_age
FROM log AS l INNER JOIN users AS u ON l.user_id = u.id

Define the materialized view here, let ClickHouse automatically connect the log and users tables, and fill the connection query results into the log_users_rt materialized view. Note that now ClickHouse needs to join to query users when logging every time it executes an insert, which will affect the insert performance:

insert into log values(now(), 12,15,'aa', 'aa', 'http://baidu.com'),(now(), 12,15, 'bb', 'bb', 'http://sina.com')

INSERT INTO log FORMAT Values

Query id: 8c3c8034-7c9e-460d-9f32-887754aea6ea

Ok.

2 rows in set. Elapsed: 0.533 sec.

Insertion will take more time, so it is best to insert in batches, and the interval between insertions is longer.

Using MySQL external tables

Even cooler, the in query can use the MySQL table engine, so you can use MySQL features to manage data, such as adding indexes, updating or deleting data.

Assuming that the users table has been created in the MySQL db database, and the structure remains consistent, create the MySQL engine table in ClickHouse as follows:

CREATE TABLE users_mysql
(`id` UInt64, `name` String, `category` String, `age` UInt8 )
ENGINE = MySQL('127.0.0.1:3306', 'db', 'users', 'usr', 'pwd')

Now execute the query using the MySQL external table:

SELECT date(dt) AS day, count(*) FROM log
WHERE (user_id IN (
  SELECT id FROM users_mysql WHERE category = 'aa' AND age > 50
)) AND (dt > (NOW() - toIntervalYear(1)))
GROUP BY day ORDER BY day DESC LIMIT 5

Note that the nested query here is executed in MySQL, so it is best to increase the corresponding index to improve performance:

CREATE INDEX c_a ON users(category, age)

It is better not to use join query here, because ClickHouse will first load data from MySQL, which is only for data filtering.

Summary

The in query in ClickHouse performs better in most scenarios, and denormalization can also improve performance. Materialized views can achieve real-time denormalization during the insert phase, but materialized views will increase disk space and affect insert performance.