Extending SQL parsing syntax in Calcite

How to use JavaCC in Calcite

Calcite uses JavaCC by default to generate lexical analyzers and syntax analyzers.

1) Use JavaCC parser

In Calcite, JavaCC dependencies have been encapsulated into the calcite-core module. If you use Maven as a dependency management tool, you only need to add the corresponding calcite-core module coordinates.

<dependency>
    <groupId>org.apache.calcite</groupId>
    <artifactId>calcite-core</artifactId>
    <version>1.26.0</version>
</dependency>

In the code, you can directly use Calcite’s SqlParser interface to call the corresponding syntax parsing process to parse the relevant SQL statements.

Analysis process:

//SQL statement
String sql = "select * from t_user where id = 1";

// Parse configuration
SqlParser.Config mysqlConfig = SqlParser.config().withLex(Lex.MYSQL);

//Create parser
SqlParser parser = SqlParser.create(sql, mysqlConfig);

// Parse SQL statement
SqlNode sqlNode = parser.parseQuery();

System.out.println(sqlNode.toString());

2) Custom syntax

Sometimes it is necessary to expand some new syntax operations. Taking the data warehouse operation – Load as an example, we will introduce how to customize the syntax.

During the Load operation, data is imported from one data source to another. The syntax template used in the Load operation is as follows.

LOAD sourceType:obj TO targetType:obj
(fromCol toCol (,fromCol toCol)*)
[SEPARATOR '\t']

Among them, sourceType and targetType represent the data source type, obj represents the data object of these data sources, (fromCol toCol) represents the field name mapping, the first line in the file is the header, and the separator defaults to the tab character.

Load statement example:

LOAD hdfs:'/data/user.txt' TO mysql:'db.t_user' (name name,age age) SEPARATOR ',';

When it comes to actual implementation, there are two options.

One is to directly modify the source code of Calcite, add the corresponding syntax logic inside its own template file (Parser.jj), and then recompile.

However, the disadvantage of this method is very obvious, that is, it is too intrusive to the source code of Calcite itself.

Another method uses a template engine to extend the syntax file. The template engine can extract the extended syntax outside the template file to achieve the purpose of program decoupling.

At the implementation level, Calcite uses FreeMarker, which is a template engine. According to the template syntax defined by FreeMarker, placeholders in the template can be replaced by the Java API setting values it provides.

The following shows Calcite adding the file structure related to grammar logic through the template engine. Its source code defines the grammar file Parser.jj as a template, uses the .ftl file in the includes directory as an extension file, and finally uses config.fmpp to unify it. configuration.

The specific operation of adding grammar can be divided into three steps:

Write new JavaCC syntax files;

Modify the config.fmpp file and configure custom syntax;

Compile template files and grammar files.

1. Write a new JavaCC grammar file

There is no need to modify the Parser.jj file. You only need to modify the .ftl file in the includes directory. For the Load operation mentioned above, you only need to add the syntax corresponding to Load in the parserImpls.ftl file.

Before writing the syntax file, you must first determine the final result from the perspective of code and object-oriented thinking, which is the SqlNode node you hope to get in the end.

After abstracting the content of the Load statement and encapsulating it, we get SqlLoad, which inherits SqlCall and represents an operation. The data source and target source in the Load operation have the same structure, so SqlLoadSource is encapsulated, and the field mapping can be encapsulated with a list. SqlColMapping only contains a Heap column mapping, SqlNodeList represents the node list.

Extended SqlLoad code implementation:

// Extend the code implementation of SqlLoad
public class SqlLoad extends SqlCall {
    // source information
    private SqlLoadSource source;
    // Endpoint information
    private SqlLoadSource target;
    // Column mapping relationship
    private SqlNodeList colMapping;
    // delimiter
    private String separator;

    // Construction method
    public SqlLoad(SqlParserPos pos) {
        super(pos);
    }
\t\t
// Extended constructor
    public SqlLoad(SqlParserPos pos,
                   SqlLoadSource source,
                   SqlLoadSource target,
                   SqlNodeList colMapping,
                   String separator) {
        super(pos);
        this.source = source;
        this.target = target;
        this.colMapping = colMapping;
        this.separator = separator;
    }
}

Since the Load operation involves two data sources, the data sources also need to be defined.

Definition class of data source in Load statement:

/**
 * Define the data source information in the Load statement
 */
@Data
@AllArgsConstructor
public class SqlLoadSource {
    private SqlIdentifier type;
    private String obj;
}

The field mapping relationships appearing in the Load statement also need to be defined.

Define the field mapping relationship in the Load statement:

// Define the field mapping relationship in the Load statement
public class SqlColMapping extends SqlCall {
    // operation type
    protected static final SqlOperator OPERATOR =
            new SqlSpecialOperator("SqlColMapping", SqlKind.OTHER);
    private SqlIdentifier fromCol;
    private SqlIdentifier toCol;
    
    public SqlColMapping(SqlParserPos pos) {
        super(pos);
}
\t\t
    // Construction method
    public SqlColMapping(SqlParserPos pos,
                         SqlIdentifier fromCol,
                         SqlIdentifier toCol) {
        super(pos);
        this.fromCol = fromCol;
        this.toCol = toCol;
    }
}

In order to output SQL statements, the unparse method also needs to be rewritten.

unparse method definition:

/**
 * Define unparse method
 */
@Override
public void unparse(SqlWriter writer, int leftPrec, int rightPrec) {
    writer.keyword("LOAD");
source.getType().unparse(writer, leftPrec, rightPrec);
\t\t
    writer.keyword(":");
    writer.print("'" + source.getObj() + "' ");
    writer.keyword("TO");
    target.getType().unparse(writer, leftPrec, rightPrec);
    
    writer.keyword(":");
    writer.print("'" + target.getObj() + "' ");
    
    final SqlWriter.Frame frame = writer.startList("(", ")");
    for (SqlNode n : colMapping.getList()) {
        writer.newlineAndIndent();
        writer.sep(",", false);
        n.unparse(writer, leftPrec, rightPrec);
    }
    
    writer.endList(frame);
    writer.keyword("SEPARATOR");
    writer.print("'" + separator + "'");
}

After the required SqlNode node class is defined, you can start writing the grammar file. The Load grammar has no redundant branch structure. Only column mapping uses loops, and there may be multiple columns.

Code example for adding syntax logic in parserImpls.ftl file:

//Node definition, returns the node we defined
SqlNode SqlLoad():
{
    SqlParserPos pos; // Parse positioning
    SqlIdentifier sourceType; // The source type is represented by an identifier node
    String sourceObj; //The source path is represented as a string, such as "/path/xxx"
    SqlIdentifier targetType;
    String targetObj;
    SqlParserPos mapPos;
    SqlNodeList colMapping;
    SqlColMapping colMap;
    String separator = "\t";
}
{
// LOAD syntax has no redundant branch structure, "go down one line", obtain the content of the corresponding position and save it to a variable
<LOAD>
    {
        pos = getPos();
    }
    
sourceType = CompoundIdentifier()

<COLON> // Colons and parentheses are already defined in Calcite's native parsing file, and we can also use them
    sourceObj = StringLiteralValue()
<TO>
    targetType = CompoundIdentifier()
<COLON>
    targetObj = StringLiteralValue()
    {
        mapPos = getPos();
    }
<LPAREN>
    {
        colMapping = new SqlNodeList(mapPos);
        colMapping.add(readOneColMapping());
    }
    (
<COMMA>
        {
            colMapping.add(readOneColMapping());
        }
    )*
    
<RPAREN>
[<SEPARATOR> separator=StringLiteralValue()]

// Finally construct the SqlLoad object and return
    {
        return new SqlLoad(pos, new SqlLoadSource(sourceType, sourceObj),
               new SqlLoadSource(targetType, targetObj), colMapping, separator);
    }
}

//Extract the content function of the string node
JAVACODE String StringLiteralValue() {
    SqlNode sqlNode = StringLiteral();
    return ((NlsString) SqlLiteral.value(sqlNode)).getValue();
}

SqlNode readOneColMapping():
{
    SqlIdentifier fromCol;
    SqlIdentifier toCol;
    SqlParserPos pos;
}
{
    { pos = getPos();}
    fromCol = SimpleIdentifier()
toCol = SimpleIdentifier()
    {
        return new SqlColMapping(pos, fromCol, toCol);
    }
}

2. Modify the config.fmpp file and configure custom syntax

You need to copy the config.fmpp file in the Calcite source code to the src/main/codegen directory of the project, and then modify the content inside to declare the extended part.

Example definition of config.fmpp file:

data: {
    parser: {
        # Generated parser package path
        package: "cn.com.ptpress.cdm.parser.extend",
        #Parser name
        class: "CdmSqlParserImpl",
#Introduced dependency classes
        imports: [
            "cn.com.ptpress.cdm.parser.load.SqlLoad",
            "cn.com.ptpress.cdm.parser.load.SqlLoadSource"
            "cn.com.ptpress.cdm.parser.load.SqlColMapping"
        ]
        # New keyword
        keywords: [
            "LOAD",
            "SEPARATOR"
        ]
        #New grammar parsing method
        statementParserMethods: [
            "SqlLoad()"
        ]
        #Extended syntax files included
        implementationFiles: [
            "parserImpls.ftl"
        ]
    }
}
# Directory of expansion files
freemarkerLinks: {
    includes: includes/
}

3. Compile template files and grammar files

In this process, the template Parser.jj file needs to be compiled into the real Parser.jj file, and then the grammar parsing code is generated based on the Parser.jj file.

Use the Maven plug-in to complete this task. The specific operation can be divided into two stages: initialization and compilation.

In the initialization phase, the codegen directory is added to the compiled resources through the resources plug-in, and then the Parser.jj file in the calcite-core package is extracted into the build directory through the dependency plug-in.

Configuration method of plug-ins required for compilation:

<plugin>
    <artifactId>maven-resources-plugin</artifactId>
    <executions>
<execution>
            <phase>initialize</phase>
            <goals>
                <goal>copy-resources</goal>
            </goals>
        </execution>
    </executions>
    
    <configuration>
        <outputDirectory>${basedir}/target/codegen</outputDirectory>
        <resources>
            <resource>
                <directory>src/main/codegen</directory>
                <filtering>false</filtering>
            </resource>
        </resources>
    </configuration>
</plugin>

<plugin>
    <!--Extract the parser grammar template from calcite-core.jar and put it in the directory where the FreeMarker template is located-->
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-dependency-plugin</artifactId>
<version>2.8</version>
    <executions>
        <execution>
            <id>unpack-parser-template</id>
            <phase>initialize</phase>
            <goals>
                <goal>unpack</goal>
            </goals>
            <configuration>
                <artifactItems>
                    <artifactItem>
                        <groupId>org.apache.calcite</groupId>
                        <artifactId>calcite-core</artifactId>
                        <version>1.26.0</version>
                        <type>jar</type>
                        <overWrite>true</overWrite>
                        <outputDirectory>${project.build.directory}/</outputDirectory>
                        <includes>**/Parser.jj</includes>
                    </artifactItem>
                </artifactItems>
            </configuration>
        </execution>
</executions>
</plugin>

These 2 plug-ins can be tested through the “mvn initialize” command.

After running successfully, you can see that there is a codegen directory in the target directory, and there is an additional Parser.jj file that has not been written.

Then comes the compilation phase, using the plug-in provided by the FreeMarker template, compiling the Parser.jj template according to config.fmpp, declaring the config.fmpp file path template and output directory, and running the plug-in in the generate-resources phase of Maven.

How FreeMarker is configured in the pom.xml file:

<plugin>
    <configuration>
        <cfgFile>${project.build.directory}/codegen/config.fmpp</cfgFile>
        <outputDirectory>target/generated-sources</outputDirectory>
        <templateDirectory>
            ${project.build.directory}/codegen/templates
        </templateDirectory>
    </configuration>
    <groupId>com.googlecode.fmpp-maven-plugin</groupId>
    <artifactId>fmpp-maven-plugin</artifactId>
    <version>1.0</version>
    <dependencies>
        <dependency>
            <groupId>org.freemarker</groupId>
            <artifactId>freemarker</artifactId>
            <version>2.3.28</version>
        </dependency>
    </dependencies>
    <executions>
<execution>
            <id>generate-fmpp-sources</id>
            <phase>generate-sources</phase>
            <goals>
                <goal>generate</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Run the “mvn generate-resources” command to generate the real Parser.jj file.

The last step is to compile the grammar file, which can be completed using the JavaCC plug-in.

JavaCC plug-in configuration method:

<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>javacc-maven-plugin</artifactId>
    <version>2.6</version>
    <executions>
        <execution>
            <phase>generate-sources</phase>
            <id>javacc</id>
            <goals>
                <goal>javacc</goal>
            </goals>
            <configuration>
                <sourceDirectory>
                    ${basedir}/target/generated-sources/
                </sourceDirectory>
                <includes>
                    <include>**/Parser.jj</include>
</includes>
                <lookAhead>2</lookAhead>
                <isStatic>false</isStatic>
                <outputDirectory>${basedir}/src/main/java</outputDirectory>
            </configuration>
        </execution>
    </executions>
</plugin>

Pay attention to the I/O directory here and place the generated code directly in the project.

It seems that several commands are used in each stage above. In fact, only one Maven command is needed to complete all steps, namely “mvn generate-resources”. This command includes the above 2 operations and all 4 plug-ins will be executed.

After compilation is completed, you can test the new grammar, configure the generated parser class in the test code, and then write a simple Load statement.

4. Sample code for testing Load statement

String sql = "LOAD hdfs:'/data/user.txt' TO mysql:'db.t_user' (c1 c2,c3 c4) SEPARATOR ','";

// Parse configuration
SqlParser.Config mysqlConfig = SqlParser.config()
        //Use parser class
        .withParserFactory(CdmSqlParserImpl.FACTORY)
        .withLex(Lex.MYSQL);

SqlParser parser = SqlParser.create(sql, mysqlConfig);

SqlNode sqlNode = parser.parseQuery();

System.out.println(sqlNode.toString());

The output result is exactly what the overridden unparse method outputs.

Results output by unparse method:

LOAD 'hdfs': '/data/user.txt' TO 'mysql': 'db.t_user'
('c1' 'c2', 'c3' 'c4')
SEPARATOR ','