Apache Doris is an easy-to-use, high-performance and unified analytics database

Introduction to Doris

https://github.com/apache/doris
Apache Doris is a high-performance, real-time analytical database based on MPP architecture. It is well known for its extremely fast and easy-to-use features. It only needs sub-second response time to return query results under massive data. It can not only support high concurrency The point query scenario can also support complex analysis scenarios with high throughput. Based on this, Apache Doris can better meet usage scenarios such as report analysis, ad hoc query, unified data warehouse construction, data lake federation query acceleration, etc. Users can build user behavior analysis, AB experiment platform, log retrieval analysis, user Portrait analysis, order analysis and other applications.

Apache Doris was first born in the Palo project of Baidu’s advertising report business. It was officially open-sourced in 2017. In July 2018, Baidu donated it to the Apache Foundation for incubation. Afterwards, under the guidance of Apache mentors, members of the incubator project management committee incubated and operate. At present, the Apache Doris community has gathered more than 400 contributors from hundreds of companies in different industries, and the number of active contributors per month exceeds 100. In June 2022, Apache Doris successfully graduated from the Apache Incubator and officially became an Apache Top-Level Project (TLP)

Apache Doris now has a wide range of user groups in China and around the world. Up to now, Apache Doris has been used in the production environment of more than 2,000 companies around the world. Among the top 50 Internet companies in China’s market capitalization or valuation, More than 80% of them have been using Apache Doris for a long time, including Baidu, Meituan, Xiaomi, JD.com, ByteDance, Tencent, Netease, Kuaishou, Weibo, Shell, etc. At the same time, it also has rich applications in some traditional industries such as finance, energy, manufacturing, and telecommunications.

Usage scenario

As shown in the figure below, after various data integration and processing, data sources are usually stored in the real-time data warehouse Doris and offline lake warehouses (in Hive, Iceberg, Hudi). Apache Doris is widely used in the following scenarios.
? Report Analysis
Dashboards
Reports for In-house Analysts and Managers
User or customer-oriented high-concurrency report analysis (Customer Facing Analytics). For example, site analysis for website owners and advertising reports for advertisers usually require tens of thousands of QPS concurrently, and query latency requires millisecond-level response. The well-known e-commerce company JD.com uses Apache Doris in advertising reports, writes 10 billion rows of data every day, queries concurrently with tens of thousands of QPS, and the 99th percentile query delay is 150ms.
? Ad-hoc Query: Self-service analysis for analysts, the query mode is not fixed and requires high throughput. Based on Doris, Xiaomi has built a growth analysis platform (Growing Analytics, GA), which uses user behavior data to analyze business growth. The average query delay is 10s, and the query delay of the 95th percentile is within 30s. The daily SQL query volume is tens of thousands strip.
? Unified data warehouse construction: A platform meets the needs of unified data warehouse construction and simplifies the cumbersome big data software stack. Haidilao’s unified data warehouse based on Doris has replaced the old architecture composed of Spark, Hive, Kudu, Hbase, and Phoenix, and the architecture has been greatly simplified.
? Data lake federated query: Federated analysis of data located in Hive, Iceberg, and Hudi through the appearance method, while avoiding data copying, the query performance is greatly improved.

Technical overview

The overall architecture of Doris is shown in the figure below. The Doris architecture is very simple, with only two types of processes
Frontend (FE) is mainly responsible for the access of user requests, query analysis planning, metadata management, and node management.
Backend (BE) is mainly responsible for data storage and query plan execution.
These two types of processes are scalable horizontally, and a single cluster can support hundreds of machines and tens of petabytes of storage capacity. And these two types of processes ensure high availability of services and high reliability of data through consistency protocols. This highly integrated architecture design greatly reduces the operation and maintenance cost of a distributed system.

In terms of use interface, Doris adopts the MySQL protocol, which is highly compatible with MySQL syntax and supports standard SQL. Users can access Doris through various client tools and support seamless connection with BI tools.
In terms of storage engine, Doris uses columnar storage to encode, compress and read data by column, which can achieve a very high compression ratio and reduce the scanning of a large amount of irrelevant data, so that it can be used more effectively IO and CPU resources.

Doris also supports a richer index structure to reduce data scanning:
? Sorted Compound Key Index, you can specify up to three columns to form a compound sort key. Through this index, data can be cut effectively, so as to better support high-concurrency report scenarios
? Z-order Index: Use the Z-order index to efficiently perform range queries on any combination of fields in the data model
? Min/Max: Efficient filtering of equivalent and range queries of numeric types
? Bloom Filter: Equivalent filtering and pruning of high-cardinality columns is very effective
? Invert Index: Ability to quickly retrieve any field
In terms of storage models, Doris supports a variety of storage models, which are optimized for different scenarios:
? Aggregate Key model: the Value column of the same Key is merged, and the performance is greatly improved through early aggregation
? Unique Key model: the key is unique, the data of the same key is overwritten, and the row-level data update is realized
? Duplicate Key model: detailed data model, satisfying the detailed storage of fact tables
Doris also supports strongly consistent materialized views. The update and selection of materialized views are performed automatically in the system without manual selection by users, thus greatly reducing the cost of materialized view maintenance.

In terms of query engine, Doris adopts the MPP model, which executes in parallel between nodes and within nodes, and also supports distributed Shuffle Join of multiple large tables, so that it can better deal with complex queries.

Doris query engine is a vectorized query engine. All memory structures can be laid out in columns, which can greatly reduce virtual function calls, improve Cache hit rate, and efficiently use SIMD instructions. In the wide table aggregation scenario, the performance is 5-10 times that of the non-vectorized engine.

Doris adopts the Adaptive Query Execution technology, which can dynamically adjust the execution plan according to Runtime Statistics. For example, the Runtime Filter technology can generate a Filter at runtime and push it to the Probe side, and can automatically penetrate the Filter to the Probe The Scan node at the bottom of the side, thereby greatly reducing the amount of Probe data and speeding up the Join performance. Doris’ Runtime Filter supports In/Min/Max/Bloom Filter.
In terms of optimizer, Doris uses an optimization strategy combining CBO and RBO. RBO supports constant folding, subquery rewriting, predicate pushdown, etc. CBO supports Join Reorder. At present, CBO is still in the process of continuous optimization, mainly focusing on more accurate statistical information collection and derivation, and more accurate cost model estimation.

Installation and deployment

Official Documentation: https://doris.apache.org/zh-CN/docs/dev/install/construct-docker/run-docker-cluster

Software environment

software version
Docker 20.0 and above
docker-compose 2.10 and above
Configuration type Hardware information Maximum running cluster size
Minimum Configuration 2C 4G 1FE 1BE
Recommended Configuration 4C 16G 3FE 3BE

Preliminary environment preparation

The following command needs to be executed on the host machine

sysctl -w vm.max_map_count=2000000

Docker Compose

Different platforms need to use different Image images. This article takes the X86_64 platform as an example.

Network Mode Description
There are two network modes applicable to Doris Docker.
1. The HOST mode is suitable for deploying across multiple nodes. This mode is suitable for deploying 1FE 1BE on each node.
2. The subnet bridge mode is suitable for deploying multiple Doris processes on a single node. This mode is suitable for single-node deployment (recommended). If you want to deploy multiple nodes, you need to deploy more components (not recommended).
For ease of presentation, this chapter only demonstrates scripts written in Subnet Bridge mode.

Interface Description
Starting from Apache Doris 1.2.1 Docker Image, the list of process mirroring interfaces is as follows:

process name interface name interface definition interface example
FE| BE\ BROKER FE_SERVERS FE node main information
FE FE_ID FE node ID 1
BE BE_ADDR BE node main information 172.20.80.5:9050
BE NODE_ROLE BE node type computation
BROKER BROKER_ADDR BROKER node main information 172.20.80.6:8000

Note that the above interface must fill in the information, otherwise the process cannot be started.

FE_SERVERS interface rule is: FE_NAME:FE_HOST:FE_EDIT_LOG_PORT[,FE_NAME:FE_HOST:FE_EDIT_LOG_PORT]
FE_ID interface rule is: an integer from 1 to 9, where FE No. 1 is the Master node.
BE_ADDR interface rule is: BE_HOST:BE_HEARTBEAT_SERVICE_PORT
NODE_ROLE interface rules are: computation or empty, where empty or other values indicate that the node type is a mix type
BROKER_ADDR interface rule is: BROKER_HOST:BROKER_IPC_PORT

Script template

Docker run command
Create a subnet bridge

docker network create --driver bridge --subnet=172.20.80.0/24 doris-network

1FE & 1BE Command Template
3FE & amp; 3BE Run command templates are available for download if required.

docker run -itd \
--name=fe\
--env FE_SERVERS="fe1:172.20.80.2:9010" \
--env FE_ID=1 \
-p 8030:8030\
-p 9030:9030\
-v /data/fe/doris-meta:/opt/apache-doris/fe/doris-meta \
-v /data/fe/conf:/opt/apache-doris/fe/conf \
-v /data/fe/log:/opt/apache-doris/fe/log \
--network=doris-network \
--ip=172.20.80.2 \
apache/doris:1.2.1-fe-x86_64

docker run -itd \
--name=be\
--env FE_SERVERS="fe1:172.20.80.2:9010" \
--env BE_ADDR="172.20.80.3:9050" \
-p 8040:8040\
-v /data/be/storage:/opt/apache-doris/be/storage \
-v /data/be/conf:/opt/apache-doris/be/conf \
-v /data/be/log:/opt/apache-doris/be/log \
--network=doris-network \
--ip=172.20.80.3 \
apache/doris:1.2.1-be-x86_64

Docker Compose script
1FE & 1BE template

version: '3'
services:
  docker-fe:
    image: "apache/doris:1.2.1-fe-x86_64"
    container_name: "doris-fe"
    hostname: "fe"
    environment:
      - FE_SERVERS=fe1:172.20.80.2:9010
      - FE_ID=1
    ports:
      -8030:8030
      -9030:9030
    volumes:
      - /data/fe/doris-meta:/opt/apache-doris/fe/doris-meta
      - /data/fe/conf:/opt/apache-doris/fe/conf
      - /data/fe/log:/opt/apache-doris/fe/log
    networks:
      doris_net:
        ipv4_address: 172.20.80.2
  docker-be:
    image: "apache/doris:1.2.1-be-x86_64"
    container_name: "doris-be"
    hostname: "be"
    depends_on:
      - docker-fe
    environment:
      - FE_SERVERS=fe1:172.20.80.2:9010
      - BE_ADDR=172.20.80.3:9050
    ports:
      -8040:8040
    volumes:
      - /data/be/storage:/opt/apache-doris/be/storage
      - /data/be/conf:/opt/apache-doris/be/conf
      - /data/be/script:/docker-entrypoint-initdb.d
      - /data/be/log:/opt/apache-doris/be/log
    networks:
      doris_net:
        ipv4_address: 172.20.80.3
networks:
  doris_net:
    ipam:
      config:
        - subnet: 172.20.80.0/16

3FE & amp; 3BE Docker Compose script templates can be downloaded here if needed.

Deploy Doris Docker

Choose one of the two deployment methods:
1. Execute the docker run command to create a cluster
2. Save the docker-compose.yaml script, and execute the docker-compose up -d command in the same directory to create a cluster
Exceptions
Due to the different ways of implementing containers internally on MacOS, it may not be possible to directly modify the value of max_map_count on the host during deployment. You need to create the following containers first:

docker run -it --privileged --pid=host --name=change_count debian nsenter -t 1 -m -u -n -i sh

The container was created successfully executing the following command:

sysctl -w vm.max_map_count=2000000

Then exit to create a Doris Docker cluster.

View FE running status

You can check whether Doris has started successfully with the following command

curl http://127.0.0.1:8030/api/bootstrap

Here the IP and port are FE’s IP and http_port (8030 by default), if you execute it on the FE node, just run the above command directly.
If the returned result contains the words “msg”:”success”, it means the startup is successful.
You can also check through the Web UI provided by Doris FE, enter the address in the browser

http://fe_ip:8030

You can see the interface below, indicating that FE has started successfully! ! !

Notice:
Here we use the built-in default user root of Doris to log in, and the password is empty
This is a Doris management interface. Only users with administrative rights can log in, and ordinary users cannot log in.


Success is made up of a lot of disappointments.