Velox: Computing engine base implemented by Meta

0 Background

I recently joined the field of computing engines, and I have a deep understanding of how important it is to look up and follow the industry when you lower your head and rush forward. The difference between computing and storage is still very big, and it is essentially a difference between presence and absence of state. Technology in the direction of computing The focus of the stack is on the main memory and above, which is more CPU-oriented; the storage focus is on the main memory and below, which is more disk-oriented. The CPU technology system is changing very fast, and both the underlying hardware technology and computing ideas are following the most cutting-edge in the industry. Computing technology is changing rapidly. Unlike the innovation of storage disk technology, which is based on ten years (it has been decades since HDD–>SSD–>NVMe-SSD–>NVM), it is limited. Due to the development of materials science and physics, we can only work hard to find performance improvement points (NVM) and capacity (remote storage) based on the distance from the CPU. However, the development of computing hardware is still in full swing. In addition to the market derived from large model systems In addition to demand advancement, there are various ways to combine computing hardware and software, efficient data interaction between hardware and main memory/cpu-cache, etc., all of which are trying various solutions to improve efficiency.

Against this background, individuals and companies need to continue to follow the progress of the industry. If they fail to catch the current train, they will be more than one step behind other people/companies.

Wesmckinney, the founder of projects such as Pandas & Apache-Arrow & ibis, can be said to have been exploring and thinking in the field of computing for nearly ten years from the time Pandas began to encounter various problems to the rise of Arrow today. Use Arrow to solve It solves the problem of data transmission efficiency between different file formats/servers/disks and memories in data analysis scenarios, and uses a unified memory format to become a universal and efficient data storage/processing/interaction platform. At this time, the Arrow project Its role has far exceeded the scope of problems Wes originally proposed to solve in projects such as Pandas, which has given Wes and the industry a huge inspiration.

Arrow has achieved the ultimate in modularity, interoperability, and composability. These features have brought new ideas to the development of computing software. Arrow can use these features to save a lot of manpower in reinventing the wheel, and we don’t need to do it ourselves. Based on the data management/storage platform, you only need to call various language libraries of arrow and call the corresponding memory/disk management interfaces to complete efficient ETL operations of data. These design ideas can be completely migrated to computing services for AP or even TP. Engine here, because capabilities such as type system, expression evaluation, and operator execution of physical plans are capabilities required by almost all computing engines. Implement a unified but internally modular computing engine for different upper-layer businesses Provide computing requirements, allowing a brand new business to be built in the fastest time, with performance that is not inferior to the most advanced analytical systems in the industry, and saves a lot of manpower to reinvent the wheel. If such a system existed, it would simply be Small innovations in the field of computing mean that a lot of resources will be invested in the ultimate optimization of computing itself, rather than interfering with each other. In the long run, this is something that will benefit human society.

Velox, as well as the related Arrow-datafusion to be introduced in this article, were developed under such a general background, are developing at a high speed, and are generating large profits at a speed visible to the naked eye.

1 Problem to be solved

The core problems Velox needs to solve in Meta are as follows:

A large number of dozens of components within the company that serve analytical systems are constantly reinventing the same wheel, and there is a large amount of data transfer between different components and the work of unifying calculation results consumes a huge amount of manpower. But the analysis scenario In addition to the SQL front-end (parser), optimizer, runtime, and io scheduling, there are more similar processing scenarios, such as the type system, expression calculation, physical operators, memory format, and resource management system during execution are all the same. .
Because of the above problems, each component cannot maximize the performance of data processing, resulting in excessive resource consumption. After all, large manufacturers still need to more reasonably reduce costs and increase efficiency.

Based on the above core issues, the solutions provided by the Velox general computing platform (the core implementation is in the form of a C++ basic library) are as follows:

Performance: A large number of runtime optimizations have been implemented, such as making full use of the SIMD architecture, lazy evaluation, adaptive predicate reshooting and predicate pushdown, common subexpression elimination, code generation, etc.
In terms of consistency: Consistency here refers to the calculation results that can ensure consistency for data input on different platforms. After all, it is the same processing system. There is no need for time-consuming data migration, calculation and result synchronization between different systems. The system is working well, it only needs the system to quickly support the following velox.
In terms of engineering efficiency: all functions and related optimizations are developed and maintained once, without the need to reinvent the wheel repeatedly, which greatly improves engineering efficiency.

Currently, Velox has integrated or is integrating with more than ten data systems of Meta (and others), such as Presto, Spark, PyTorch, XStream (stream processing), F3 (feature engineering), FBETL (data extraction)), XSQL (distributed Transaction processing), Scribe (message bus infrastructure), Saber (high QPS external services).

For example, the Presto project is meta’s SQL computing engine. Coordinator nodes are responsible for SQL parser, optimization, and resource management. Worker nodes get the physical execution plan for actual execution. Most of the execution time is consumed in work-nodes. During data processing, because Presdo is implemented in Java, there will also be a lot of running overhead of Java processes, JVM, and GC. Velox quickly implemented the Prestissimo project (the core implementation is very simple, which allows velox to understand the coordinator’s plan And can communicate with the coordinator), used to replace the function of work-nodes, accept the physical plan of coordinator-nodes, and execute it. The interior is completely efficient data interaction of velox worker-worker, with a funny processing engine and a unified processing method. In an actual production environment, Prestissimo can reduce work-nodes machine resources by 3 times while providing the same computing power.

Similar work is also implemented in Spark using velox. Spruce can deserialize Spark’s computing tasks into physical plans that velox can recognize, and execute them in velox, which also has a direct effect on performance.

There are also the aforementioned stream processing or F3 feature processing systems that serve machine learning, which can generate greater profits in a shorter period of time, and different systems can completely use the different capabilities decoupled by velox, such as just using Type system + calculation function can meet the needs of stream processing.

2 Implementation

2.1 Type System

Students who have worked on computing engines all know how difficult it is to implement a complete, accurate, easy to maintain and scalable type system. Databases like PG that follow standards have more than 400 basic types, and more Needless to say, there are endless UDTs (User defined type). And the most complicated one is how decimal in Numeric can ensure correctness when it supports high enough precision. The entire type system must be done correctly. It will take at least several man-years to come out, and an elegant type system like Arrow is even further away.

Therefore, the underlying type system implementation of Velox has a lot of Arrow shadows. After all, in addition to meta developers, behind Velox there is also the support of Wes and other Arrow core teams. The data input by the upper system can be uniformly encoded, decoded and corresponding in Velox Type conversion and evaluation is the core of the entire calculation engine, and it is also the most abstract part.

2.2 Vector memory format

The organization of Velox’s memory format is extended based on Arrow’s memory format. The memory format stores different types of data (arrays, strings, null-values) in memory in a unified and efficient manner and provides efficient Encoding and decoding methods. The function of Velox’s RowVector in the implementation is the function of Arrow’s RecordBatch, but Velox has made some interesting extensions based on it.

LazyVector is used to implement lazy evaluation. The actual value will only be taken the first time it is used. In some scenarios, it is used to reduce or limit IO operations, such as reading some sparse columns. Sometimes, a lot of data that does not need to be read is filtered out through the ValueHook callback.
DecodedVector serves a business need, supports flattening of very large multi-dimensional vectors, and can add indexing and coding to the flattened data collection. Then it provides users with a unified access interface for efficient access.

At the same time, Velox has also made some differences in type details compared to Arrow that suit Meta’s needs:

1. The string design adds the metadata area of StringView. Arrow’s data storage is uniformly represented by memory buffer + size.

 const uint8_t* data_;
  int64_t size_;

Arrow adds a metadata field based on buffer:

 uint32_t size_;
  char prefix_[4];
  union {<!-- -->
    char inlined[8];
    const char* data;
  } value_;

The four bytes of the prefix_ field are used to represent the string prefix, which is used to speed up filtering and sorting operations and better utilize simd’s efficient comparison. The other size_ and value_ is completely inline. For example, for operations such as trim() and substr(), you only need to update the pointer.

2. The internal data of RowVector supports unordered updates. That is, if the fields of RowVector need to be updated in branch statements such as if/switch, if unordered writing can be supported at this time, simd can be updated in batches. , without the need for a for loop to perform updates one by one. The reason why Velox can support this is to ensure that the data size of a certain column to be updated is constant. For variable-length data types, Velox also supports size + offset buffers, unordered When writing, just specify the offset + size to be written for each element.

3. Support for more encoding types. Velox supports Run-Length Encoding and Const Encoding. The former can support efficient lossless compression capabilities, while the latter serves literal In the partition-key scenario, that is, if a certain column all has the same value, only one encoding result needs to be stored, which greatly saves storage space.

Regarding the differences in implementation details of the types mentioned above and arrow, velox is also communicating and merging them with the arrow community.

2.3 Expression evaluation

Application scenarios:

Can be used by the FilterProject operator for filtering and projection operations.
TableScan and IO Connects can use this to evaluate whether a predicate pushdown operation is required.
It can be used independently, such as data preprocessing in machine learning.

An input expression will be represented by Expression-Tree. Each node may have the following types of representations (this is relatively common in other computing engines, and the general implementation is similar, including arrow-acero, duckdb, indeed can be abstracted):

Input column reference. For example, in PG it is Var, in Arrow it is FieldRef
Constant. Const type/Literal
Function expression. For example, T_FuncExpr in PG, including function and parameter expressions
CAST expression. Used to convert the output result of an expression into a specified type.
lambda function. User-defined function type.

The input expression tree is similar to the following

The evaluation process of Velox expressions mainly consists of two steps:

The implementation of this process is optimized using JIT in PG, because most traditional data executors are interpreted, and Velox uses compiled execution to take advantage of some RunTime optimizations:

1. Compile. Compile one or more expression lists into new executable expressions. This process can apply many runtime optimizations.
include:

Common subexpression elimination. For example, for the input expression tree strpos(upper(a), 'FOO') > 0 OR strpos(upper(a), 'BAR') > 0 , the two subexpressions separated by the predicate OR both contain a common expression upper(a), then this expression will be eliminated during compilation, ensuring that only Just execute it once.
Constant folding. The process of replacing the input deterministic expression with Const/Literal. For example, the expression upper(a) = upper('Foo') will be replaced with upper(a) = 'Foo'
Adaptive preposition rearrangement. Velox dynamically tracks the performance of the prepositions of the input expression. Which preposition can filter the most values in the shortest time, the expression after which preposition will be evaluated first; and will be flat OR/ADD prepositions, such as the expression AND(AND(AND(a,b),c),d) will be flattened to AND( a,b,c,d).

2. Evaluation. Evaluate the expression generated by compilation and calculate the actual output result. During actual execution, SIMD instructions will be used to speed up the evaluation process of certain functions/scenarios.

In addition to the above two basic steps, Velox uses compiler features to support code generation during the entire expression evaluation process. This feature has not yet been put into production. And the application scenarios are also similar to JIT capabilities, which are only for those with relatively high costs. The plan effect is better, because the performance improvement of code generation essentially uses CPU + memory in exchange for execution time, and it also completely relies on the optimization ability of the compiler. After all, compiler engineers must know more about CPU-friendliness than database kernel engineers. code.

The paper also spent a complete summary to introduce Velox’s function system. The basic design of this piece is basically the same as arrow::compute, because a complete function management system is needed to shield various data Differences in types, and can be executed efficiently and accurately. Because there are many functions to be supported, some functions are more general, such as less, equal, which can be based on User output types are automatically compared, taking advantage of SIMD capabilities.

The design of the entire framework is basically the same as arrow::Expression::Call. The internal scheduling is still more complicated after going deep into arrow::Kernel. The entire system is built using C++ template metaprogramming. This part is worth studying in depth.

2.4 Operator

This block is the core part of the physical execution plan, including basic filter, project, tablescan, Agg, join and other basic operator nodes. These operator nodes may be converted into one or more operators in Velox for actual execution. .
For example, there is a Project after Fiter, and these two nodes will be uniformly converted into FilterProject operator; and HashJoin A node will be converted into two operator nodes, HashBuild and HashProbe.

The plan scheduling after converting to operator can be parallelized. The execution method of the entire plan is Push. With Exchange/TableScan as the starting node, the read data is pushed to the downstream nodes with RowVector as the granularity as input:

For example, in the figure above, the physical plan of HashJoin is converted into multiple tasks in the pipeline for concurrent execution. Two threads are enabled for the probe side, and three threads are enabled for the build side.

The Build side and the Probe side will share data through JoinBridge, that is, the data read by the probe can be found in the hash table built by the Build side in JoinBridge.

Velox also uses SIMD for operator optimization. For example, filters and projects are generally used together with tablescan. Column data is filtered before being output from tablescan, which can filter most columns. At this stage, Filter can be executed using the AVX2 simd instruction set. Perform filter checks on multiple values at one time. The same AVX2 optimization can also be used in hashjoin probe detection. arrow::Acero uses swisstable for the implementation of hash tables, which can more efficiently use simd instructions for hash matching. In the paper The implementation details of velox’s hash table are not mentioned.

2.5 Memory Management

This section mainly introduces Velox’s own memory management and IO optimization.

Memory management is divided into two parts:

Memory allocation and release during plan execution. For small blocks of memory, C++’s new logic will be used, and large memory allocations are managed by its own mempool. Large blocks of memory are allocated through mmap + madivse, and mempool also provides memory allocation. With tracking, you can clearly see the memory resource consumption of each operator during the execution of the plan.
Operators execute the spill mechanism when memory is insufficient. This is also an advantage of self-management of memory. It can track the memory usage of each operator, whether it is necessary to use memory overflow logic or reclaim memory from other nodes to the current node when memory is insufficient. . Insufficient memory is also a scenario for serving ultra-large-scale data analysis, especially for nodes such as sort/join. Sort needs to sort all input data, and the sorted results cannot be completely stored in the memory. It must be temporarily Store it to disk and reload it for subsequent use. Basic databases should support this capability.

IO management means that prefetch should be used as much as possible to prefetch data from remote storage to reduce the difference in magnitude between disk access efficiency and memory access efficiency, thereby ensuring that the pipeline scheduling of the entire plan operator will not be interrupted due to differences in IO performance. (Everyone is waiting for tablescan to feed data). This process will also use self-managed data cache and local ssd-cache to cache data read from the remote end, speeding up the access efficiency of high-frequency data.

3 Performance benefits

Take a look at the performance gains.

The following data is a comparison of the Prestissimo velox engine and the Presto java engine under the four queries tpch1, 6, 13, 19. The first two queries are CPU-intensive agg Calculate query, the last two are io-intensive hashjoin + scan query.

Test scenario: 80nodes cluster, each node 64G RAM + 2*2TB-SSD. TPCH 3TB orc format data set.

The final result can be seen that the overall improvement of Prestissimo compared to preso’s java engine is very obvious.

In this test scenario, the performance bottleneck is not on the velox side. The bottlenecks of Q1 and Q6 are on the coordinator side of presto, which requires frequent interaction with multiple worker-nodes nodes to process metadata. The bottlenecks of Q13 and Q19 are on data redistribution, which requires Some optimizations in data encoding to reduce the amount of data transmitted when hashjoin redistributes data between multiple nodes.

Another test was to start two clusters, Prestissimo and presto-java respectively. The two clusters pressed the same workload, and then slowly reduced the number of nodes in the Prestissimo cluster. Finally, it was found that Prestissimo based on velox could provide the same workload. Reduce the number of servers to one third of presto-java.

Summary

The platformization of computing engines has become a trend. Meta/Databricks/Snowflake/Voltron Data/Google are already collaborating, and top-level projects such as arrow/duckdb/volex were born, and they are open source. As far as the benefits seen so far are concerned, , whether it is the application of volex within meta or datafusion in various rust databases, obvious benefits can be seen in the short term.

Using such a project to build your own company’s computing platform can build a computing engine with good performance with less manpower and a very short time. This is great news for start-up companies. In the analytical market, storage here already has complete storage support such as parquet/orc + arrow. Transactions are not a strong demand in analysis scenarios. Maybe MVCC is enough, which means that storage is no longer the core of the gap. Velox is like this The project has allowed everyone to stand on the same starting line in the computing field (performance)… In the end, what are the domestic analytical databases/data warehouses/streaming/cloud native databases that can compete in the market?

Extremely strong stability, ultimate user experience, numerous user scenarios (super data platform, connecting data from various systems, just like the various connectors supported by velox) and “extremely strong business”?