Explore ClickHouse – Using Projection to Speed Up Querying

Outline

  • download file
    • Check files
    • Move files
  • Create table
    • View Files
    • Use client to connect to server
    • Create table
    • Import Data
    • Check data
      • Check the number of data rows
      • Check the disk occupied
  • Inquire
    • Add PROJECTION
    • Query after optimization
  • References

Before testing Projection, we need to create a table and import a large amount of data.

We can use the directive directly to get the content from the file pointed to by the URL and import it into the table. But we are worried about network instability, so we download the file first.

Download file

wget wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv .

Check files

wc -l pp-complete.csv

28497127pp-complete.csv

ll pp-complete.csv

-rw-rw-r– 1 fangliang fangliang 4982107267 Aug 29 05:13 pp-complete.csv

That is, this file has about 28.5 million lines and occupies more than 4 G disk.

Move files

su root
cp pp-complete.csv /var/lib/clickhouse/user_files/
exit

Create table

View files

Use the following command to view the file contents

head -10 pp-complete.csv
"{<!-- -->F887F88E-7D15-4415-804E-52EAC2F10958}","70000","1995-07-07 00:00","MK15 9HP\ ","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES" ,"MILTON KEYNES","MILTON KEYNES","A","A"
"{<!-- -->40FD4DF2-5362-407C-92BC-566E2CCE89E9}","44500","1995-02-03 00:00","SR6 0AQ",\ "T","N","F","50","","HOWICK PARK","SUNDERLAND","SUNDERLAND","SUNDERLAND ","TYNE AND WEAR","A","A"
"{<!-- -->7A99F89E-7D81-4E45-ABD5-566E49A045EA}","56500","1995-01-13 00:00","CO6 1SQ",\ "T","N","F","19","","BRICK KILN CLOSE","COGGESHALL","COLCHESTER"," BRAINTREE","ESSEX","A","A"
"{<!-- -->28225260-E61C-4E57-8B56-566E5285B1C1}","58000","1995-07-28 00:00","B90 4TG",\ "T","N","F","37","","RAINSBROOK DRIVE","SHIRLEY","SOLIHULL","SOLIHULL ","WEST MIDLANDS","A","A"
"{<!-- -->444D34D7-9BA6-43A7-B695-4F48980E0176}","51000","1995-06-28 00:00","DY5 1SA",\ "S","N","F","59","","MERRY HILL","BRIERLEY HILL","BRIERLEY HILL",\ "DUDLEY","WEST MIDLANDS","A","A"
"{<!-- -->AE76CAF1-F8CC-43F9-8F63-4F48A2857D41}","17000","1995-03-10 00:00","S65 1QJ",\ "T","N","L","22","","DENMAN STREET","ROTHERHAM","ROTHERHAM","ROTHERHAM ","SOUTH YORKSHIRE","A","A"
"{<!-- -->709FB471-3690-4945-A9D6-4F48CE65AAB6}","58000","1995-04-28 00:00","PE7 3AL",\ "D","Y","F","4","","BROOK LANE","FARCET","PETERBOROUGH","PETERBOROUGH ","CAMBRIDGESHIRE","A","A"
"{<!-- -->5FA8692E-537B-4278-8C67-5A060540506D}","19500","1995-01-27 00:00","SK10 2QW",\ "T","N","L","38","","GARDEN STREET","MACCLESFIELD","MACCLESFIELD","MACCLESFIELD ","CHESHIRE","A","A"
"{<!-- -->E78710AD-ED1A-4B11-AB99-5A0614D519AD}","20000","1995-01-16 00:00","SA6 5AY",\ "D","N","F","592","","CLYDACH ROAD","YNYSTAWE","SWANSEA","SWANSEA ","SWANSEA","A","A"
"{<!-- -->1DFBF83E-53A7-4813-A37C-5A06247A09A8}","137500","1995-03-31 00:00","NR2 2NQ",\ "D","N","F","26","","LIME TREE ROAD","NORWICH","NORWICH"," NORWICH","NORFOLK","A","A"

Use the client to connect to the server

clickhouse-client

Create table

CREATE TABLE uk_price_paid ( price UInt32, date Date, postcode1 LowCardinality(String), postcode2 LowCardinality(String), type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached\ ' = 3, 'flat' = 4, 'other' = 0), is_new UInt8, duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0 ), addr1 String, addr2 String, street LowCardinality(String), locality LowCardinality(String), town LowCardinality(String), district LowCardinality(String), county LowCardinality(String) ) ENGINE = MergeTree ORDER BY (postcode1, postcode2, addr1, addr2);

Import data

INSERT INTO uk_price_paid WITH splitByChar(' ', postcode) AS p SELECT toUInt32(price_string) AS price, parseDateTimeBestEffortUS(time) AS date, p[1] AS postcode1, p[2] AS postcode2, transform(a , ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', ' detached', 'flat', 'other']) AS type, b = 'Y' AS is_new, transform(c, ['F', 'L', 'U '], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county FROM file( 'pp-complete.csv ', 'CSV', 'uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String' );


The entire processing speed is about 210 thousand rows/s, 36.5MB/s.

INSERT INTO uk_price_paid WITH splitByChar(’ ‘, postcode) AS p
SELECT
toUInt32(price_string) AS price,
parseDateTimeBestEffortUS(time) AS date,
p[1] AS postcode1,
p[2] AS postcode2,
transform(a, [‘T’, ‘S’, ‘D’, ‘F’, ‘O’], [‘terraced’, ‘semi-detached’, ‘detached’, ‘flat’, ‘other’]) AS type,
b = Y’ AS is_new,
transform(c, [F’, L’, U’], [freehold’, leasehold’, unknown’]) AS duration,
addr1,
addr2,
street,
locality,
town,
district,
county
FROM file(‘pp-complete.csv’, ‘CSV’, ‘uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String’)
Query id: 32a2a670-8417-470d-ab26-6368dd1725e5
Ok.
0 rows in set. Elapsed: 140.063 sec. Processed 28.50 million rows, 4.98 GB (203.46 thousand rows/s., 35.57 MB/s.)

Check data

Check the number of data rows

SELECT count() From uk_price_paid;

SELECT count()
FROM uk_price_paid
Query id: 2d05b3f1-c683-4f2d-bcaf-e05b777eb3f8
┌──count()───┐
│ 28497127 │
└──────────┘
1 row in set. Elapsed: 0.005 sec.

There are 28,497,127 lines of data in total, which is the same as the number of lines in the file.

Check the disk occupied

SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'uk_price_paid';

SELECT formatReadableSize(total_bytes)
FROM system.tables
WHERE name = uk_price_paid’
Query id: 7cca5694-6d15-4f38-8f8d-ef8331a4caa3
┌─formatReadableSize(total_bytes)─┐
│ 308.18 MiB │
└──────────────────────┘
1 row in set. Elapsed: 0.007 sec.

Compared with the previous file size of 4G, it has been reduced by 9/10. This ratio is quite large.

Query

SELECT toYear(date), district, town, avg(price), sum(price), count() FROM uk_price_paid GROUP BY toYear(date), district, town;

80441 rows in set. Elapsed: 2.114 sec. Processed 28.50 million rows, 284.78 MB (13.48 million rows/s., 134.71 MB/s.)

Add PROJECTION

Use the following command to create a PROJECTION for toYear(date), district, town, so that the data inserted later will be automatically optimized.

ALTER TABLE uk_price_paid ADD PROJECTION projection_by_year_district_town(SELECT toYear(date), district, town, avg(price), sum(price), count() GROUP BY toYear(date), district, town);

ALTER TABLE uk_price_paid
ADD PROJECTION projection_by_year_district_town
(
SELECT
toYear(date),
district,
town,
avg(price),
sum(price),
count()
GROUP BY
toYear(date),
district,
town
)
Query id: 3c5ca13e-4805-412c-845a-ab18c411261c
Ok.
0 rows in set. Elapsed: 0.007 sec.

Then use the following command to modify the existing data

ALTER TABLE uk_price_paid MATERIALIZE PROJECTION projection_by_year_district_town SETTINGS mutations_sync = 1;

ALTER TABLE uk_price_paid
MATERIALIZE PROJECTION projection_by_year_district_town
SETTINGS mutations_sync = 1
Query id: 7bd22c05-c74c-4972-be6d-174eaf99c498
Ok.
0 rows in set. Elapsed: 0.183 sec.

Query after optimization

80441 rows in set. Elapsed: 0.170 sec. Processed 92.93 thousand rows, 5.76 MB (548.06 thousand rows/s., 33.98 MB/s.)

You can see that the time is also reduced to 1/10 of the unoptimized time.

Reference materials

  • https://clickhouse.com/docs/en/getting-started/example-datasets/uk-price-paid