18 | How to develop a big data SQL engine yourself?

Starting today, we have entered the third module of the column, and let’s take a look at the practical process of big data development. When learning a technology, it is always difficult to passively accept it as a learner. But if you look at it from the perspective of a developer, many things will suddenly become clear to you, and you will understand the principles. Sometimes you don’t even need to learn, and you can deduce various implementation details by following the principles.

All kinds of knowledge are always messy on the surface. If you only learn these complicated knowledge points, of course your knowledge will be limited, and it will be difficult to improve your ability to adapt to problems. Therefore, some experts seem to know everything and can explain clearly no matter what technology they talk about. In fact, it is not that they have learned and mastered all the technologies, but that they only started to deduce and quickly make deductions when talking about this issue. get conclusion.

I interviewed an intern from Jiaotong University. She probably only learned some basic knowledge of MapReduce. I asked her how to use MapReduce to implement the join operation of the database. It was obvious that she had not learned this part of the knowledge. She said: I thought about it, then stared at the table for two or three seconds, and then started to answer. It is basically the same as the implementation mechanism of Hive. It can be seen from her answer that this girl is a master. A master does not have to be senior or experienced. He must grasp the core essence of technology, master the ability of rapid analysis and deduction, and be able to quickly advance his knowledge and skills to unfamiliar situations. The field is a master.

This is also the purpose of this column, to describe the core principles of big data technology, share some efficient thinking and ways of thinking, and help build your own technical knowledge system.

In this module, various issues that need attention in big data development and corresponding solutions will be described from the perspective of a big data developer. I hope that I can step into my role, escape from the complex knowledge surface, master the core principles and ways of thinking, and then integrate various technologies, and then through various practical trainings, finally become a true master.

In the previous column, we mentioned that the three dreams of programmers are to develop databases, operating systems and compilers. Today we will look at how to develop a big data SQL engine by ourselves through a design and development case of a big data warehouse engine that supports standard SQL syntax.

Before learning today’s content, let’s review the big data warehouse Hive discussed earlier. As a successful big data warehouse, it converts SQL statements into MapReduce execution processes and lowers the threshold for big data applications to the point where ordinary data analysts and engineers can quickly get started. If you forget this part, you can go back to Issue 11 of the column to review it.

However, Hive also has its own problems. Because it uses its own defined Hive QL syntax, it is still difficult for analysts who are already familiar with traditional data warehouses such as Oracle to get started. In particular, many companies have been using traditional data warehouses for data analysis for a long time, and have accumulated a large number of SQL statements. After years of modification and polishing, these SQL statements are very large and complex. I once saw a statistical report SQL from a bank, which could be printed out on two A4 sheets of paper. It may take a long time to fully understand such SQL, and converting it into Hive QL is even more laborious, not to mention the bugs that may be introduced during the conversion and modification process.

In 2012, I was still on the big data team of Intel Asia Pacific R&D Center. At that time, the team decided to develop a big data warehouse engine that could support standard database SQL, hoping that SQL that ran well on Oracle could run directly on Hadoop. There is no need to rewrite it into Hive QL. This is what became the Panthera project.

Before developing Panthera, we analyze the main processing process of Hive, which is roughly divided into three steps:

1. Convert the input Hive QL into a Hive abstract syntax tree (Hive AST) through the syntax parser.

2. Convert Hive AST into MapReduce execution plan through semantic analyzer.

3. Submit the generated MapReduce execution plan and Hive execution function code to Hadoop for execution.

The design idea of Panthera is to keep the Hive semantic analyzer intact and replace the Hive syntax parser so that it can convert standard SQL statements into Hive abstract syntax trees that the Hive semantic analyzer can process. To express it graphically, the part in the red box is used to replace the part of the original Hive in the black box.

We have redeveloped the components in the red box. The light blue ones are an open source SQL syntax parser we use to parse standard SQL into a standard SQL abstract syntax tree (SQL AST). The dark blue ones behind are developed by the team themselves. SQL abstract syntax tree parser and converter, converts SQL AST into Hive AST.

So what is the difference between standard SQL and Hive QL?

There are two main differences between standard SQL and Hive QL. One is the syntax expression. Hive QL syntax is slightly different from standard SQL syntax. The other is that Hive QL supports many fewer syntax elements than standard SQL, such as data warehouse. All SQL statements of TPC-H, the main test set in the field, are not supported by Hive. In particular, Hive does not support complex nested subqueries, which are almost ubiquitous for data warehouse analysis. For example, the following SQL contains another SQL statement in the where query condition exists.

select o_orderpriority, count(*) as order_count
from orders
where o_orderdate >= date '[DATE]'
and o_orderdate < date '[DATE]' + interval '3' month
and exists
(select * from lineitem
where l_orderkey = o_orderkey and l_commitdate < l_receiptdate )
group by o_orderpriority order by o_orderpriority;

Therefore, the difficulty in developing a SQL engine that supports standard SQL syntax becomes how to eliminate complex nested subqueries, that is, the where condition does not contain select.

The theoretical basis of SQL is relational algebra, and there are only five main operations of relational algebra, namely union, difference, product, selection, and projection. All SQL statements can finally be completed using a combination of these five operations. A nested subquery can be equivalently converted into a join operation.

For example, this SQL

select s_grade from staff where s_city not in (select p_city from proj where s_empname=p_pname)

This is a SQL statement with a not in subquery nested in the where condition. It can be equivalently converted using left outer join and left semi join. The example is as follows. This is the equivalent SQL obtained by Panthera’s automatic conversion. This SQL statement no longer contains nested subqueries,

select panthera_10.panthera_1 as s_grade from (select panthera_1, panthera_4, panthera_6, s_empname, s_city from (select s_grade as panthera_1, s_city as panthera_4, s_empname as panthera_6, s_empname as s_empname, s_city as s_city from staff) panthera_14 outer left join (select panthera_16.panthera_7 as panthera_7, panthera_16.panthera_8 as panthera_8, panthera_16.panthera_9 as panthera_9, panthera_16.panthera_12 as panthera_12, panthera_16.panthera_13 as panthera_13 from (select panthera_0.panthera_1 as panthera_7, pan thera_0.panthera_4 as panthera_8, panthera_0.panthera_6 as panthera_9, panthera_0.s_empname as panthera_12, panthera_0.s_city as panthera_13 from (select s_grade as panthera_1, s_city as panthera_4, s_empname as panthera_6, s_empname, s_city from staff) panthera_0 left semi join (select p_city as panthera_3, p_pname as panthera_5 from pro j) panthera_2 on (panthera_0.panthera_4 = panthera_2.panthera_3) and (panthera_0.panthera_6 = panthera_2.panthera_5) where true) panthera_16 group by panthera_16.panthera_7, panthera_16.panthera_8, panthera_16.panthera_9, panthera_16.panthera_12, panthera _16.panthera_13) panthera_15 on ( (((panthera_14.panthera_1 <=> panthera_15.panthera_7) and (panthera_14.panthera_4 <=> panthera_15.panthera_8)) and (panthera_14.panthera_6 <=> panthera_15.panthera_9)) and (panthera_14.s_empname <=> panthera_15.panthera_12 )) and (panthera_14.s_city <=> panthera_15.panthera_13) where ((((panthera_15.panthera_7 is null) and (panthera_15.panthera_8 is null)) and (panthera_15.panthera_9 is null)) and (panthera_15.panthera_12 is null )) and (panthera_15.panthera_13 is null)) panthera_10 ;

The syntax trees of the above two SQLs are displayed through visualization tools, as follows.

This is the original SQL abstract syntax tree.

This is the abstract syntax tree after equivalent conversion. The content is too compressed to be seen clearly, but you can feel it.

So, how to implement such complex syntax conversion in programming? At that time, the Panthera project combination used several classic design patterns. Each syntax point was encapsulated into a class for processing. Each class usually had only a few dozen lines of code, making the entire program very simple and refreshing. If you encounter an unsupported syntax point during the testing process, you only need to add a new class for this syntax point. Team collaboration and code maintenance are very easy.

Using the decoration mode’s syntax equivalent transformation class construction, every time Panthera adds a new syntax transformation capability, it only needs to develop a new Transformer class and then add it to the constructor code below.

 private static SqlASTTransformer tf =
      new RedundantSelectGroupItemTransformer(
      new DistinctTransformer(
      new GroupElementNormalizeTransformer(
      new PrepareQueryInfoTransformer(
      new OrderByTransformer(
      new OrderByFunctionTransformer(
      newMinusIntersectTransformer(
      new PrepareQueryInfoTransformer(
      new UnionTransformer(
      new Leftsemi2LeftJoinTransformer(
      new CountAsteriskPositionTransformer(
      new FilterInwardTransformer(
      //use leftJoin method to handle not exists for correlated
      new CrossJoinTransformer(
      new PrepareQueryInfoTransformer(
      new SubQUnnestTransformer(
      new PrepareFilterBlockTransformer(
      new PrepareQueryInfoTransformer(
      new TopLevelUnionTransformer(
      new FilterBlockAdjustTransformer(
      new PrepareFilterBlockTransformer(
      new ExpandAsteriskTransformer(
      new PrepareQueryInfoTransformer(
      new CrossJoinTransformer(
      new PrepareQueryInfoTransformer(
      new ConditionStructTransformer(
      new MultipleTableSelectTransformer(
      new WhereConditionOptimizationTransformer(
      new PrepareQueryInfoTransformer(
      newInTransformer(
      new TopLevelUnionTransformer(
      newMinusIntersectTransformer(
      new NaturalJoinTransformer(
      new OrderByNotInSelectListTransformer(
      new RowNumTransformer(
      new BetweenTransformer(
      new UsingTransformer(
      new SchemaDotTableTransformer(
      new NothingTransformer())))))))))))))))))))))))))))))))))));

In the specific Transformer class, the combination mode is used to traverse the abstract syntax tree AST. The following is the traversal of the Between syntax node. We see that using the combined mode to traverse the tree does not require a recursive algorithm, because the recursive characteristics are already hidden in the structure of the tree.

 @Override
  protected void transform(CommonTree tree, TranslateContext context) throws SqlXlateException {
    tf.transformAST(tree, context);
    trans(tree, context);
  }

  void trans(CommonTree tree, TranslateContext context) {
    // deep firstly
    for (int i = 0; i < tree.getChildCount(); i + + ) {
      trans((CommonTree) (tree.getChild(i)), context);
    }
    if (tree.getType() == PantheraExpParser.SQL92_RESERVED_BETWEEN) {
      transBetween(false, tree, context);
    }
    if (tree.getType() == PantheraExpParser.NOT_BETWEEN) {
      transBetween(true, tree, context);
    }
  }

The equivalent converted abstract syntax tree AST is further converted into an abstract syntax tree in Hive format, which can then be handed over to Hive’s semantic analyzer for processing, thereby realizing support for standard SQL.

At that time, in order to prove Hive’s support for data warehouses, Facebook engineers manually converted TPC-H’s test SQL into Hive QL. We compared these manual Hive QL and Panthera tests. The performance of the two has its own advantages, but overall it is not the same. In comparison, this shows that Panthera’s automatic syntax analysis and conversion efficiency is quite good.

The comparison test results between Panthera (ASE) and Facebook’s manual Hive QL are as follows.

In fact, there are many syntax points in the standard SQL syntax set. After nearly two years of hard work and racking our brains to carry out various equivalent transformations of relational algebra, we still have not adapted all the standard SQL syntax.

When developing Panthera, I checked a lot of websites and documents about SQL and databases, and found that besides the mainstream databases we are familiar with, there are many unknown databases. In addition, there are a large number of SQL syntax parsers, Test data and script sets, and various peripheral tools. Products such as MySQL and Oracle that we often see are just the tip of the iceberg of the entire database ecosystem. There are still many excellent databases that have fallen behind in the competition and are unknown, and more papers and tools that support these excellent databases are almost unheard of by non-industry people.

This realization touched me a lot. I have always looked forward to us Chinese people developing our own operating systems, databases, and programming languages. I have also seen many people investing in it one after another. But after so many years, most of the efforts ended in disgrace, and a small number of them became the butt of jokes and became people’s talk after dinner. I once thought, why is this happening?

After developing Panthera, I thought that although there are many people engaged in software development in our country, most of them are doing the top-level application development. There are too few people developing and researching the underlying technology, and the results produced are too few. In the absence of a surrounding ecosystem and insufficient competitive products, our desire to directly develop our own influential operating system, database, and programming language is tantamount to trying to plant a few towering trees in the desert.

Fortunately, however, we are seeing more and more Chinese people appearing in underlying technology products with global influence. The grass is already growing silently, and in time, big trees will appear.

Today we are talking about how a SQL engine is designed. It may be almost impossible to develop a SQL engine in your job, but understanding these basic knowledge and some design techniques will help you make good use of the database and make development more flexible and efficient. Flexible systems can also be helpful.