A lexical/syntax analyzer that generates sysY2022 grammars based on flex/bison tools

Generating a lexical/syntactic analyzer for sysY2022 grammar based on flex/bison tools

The content of this article is based on the codes of open source works of outstanding alumni and senior sisters who participated in previous competitions. It is only for learning purposes and is now shared with everyone. If there is any infringement, please contact me to delete it. The development environment I use is Ubuntu16.04. It is worth mentioning that the SysY grammar uses EBNF description, and the grammar description in bison uses the BNF paradigm, which involves how to use BNF to rewrite EBNF.

Introduction to sysY2022

The SysY language is the programming language to be implemented in the Compilation System Design Competition. Extended from a subset of the C language. You can go to https://compiler.educg.net/ to view related content.

1. Create test.l file

First part:

Definition part: Enclosed with %{ %}, explain the variables and files to be referenced in the rule part, and the content of this part will be directly copied to the generated C file lex.yy.c. The regular expression definition defines the regular expressions quoted by the rule part, using regular expressions, similar to the macro definition of the C language. %s defines the annotation state, which means that after entering this state, only patterns in this state can be matched. INITIAL is the default state.

%{
#include <stdio.h>
#include "tiny.tab.h"
extern int yyline; //number of global variable lines
%}

intnum [1-9][0-9]*|0[0-7]*|(0x|0X)[0-9a-fA-F]*
floatnum [0-9] + [Ee][0-9] +
id [a-zA-Z_][a-zA-Z0-9_]*

%s COMMENT //multi-line comment
%s LINECOMMENT //Single line comment

Second part:

Pattern matching parts: %% separates each part. The lexical analyzer accepts the source program as an input stream. When the lexical analyzer matches the defined pattern, it returns the corresponding lexical unit (TOKEN) to the grammatical analyzer (the source program is disassembled into lexical units and returned to the lexical analyzer). Tokens are defined in tiny.tab.h. It will be mentioned later that tiny.tab.h is generated by tiny.y.

%%
"/*" {BEGIN(COMMENT);}
<COMMENT>"*/" {BEGIN(INITIAL);}
<COMMENT>([^*]|\
) + |.
<COMMENT><<EOF>> {printf("Unterminated comment\
"); return 0;}
"//" {BEGIN(LINECOMMENT);}
<LINECOMMENT>.*
<LINECOMMENT>\
 {BEGIN(INITIAL);}
<LINECOMMENT><<EOF>> {BEGIN(INITIAL);}

[ \t] {;} // ignore whitespace
\
 {yylineno + + ;} //matched to a newline character, the number of lines + 1
"int" {printf("<INT>");return (INT);} //Add side effects, output for us to debug.
"float" {return (FLOAT);}
"const" {return (CONST);}
"void" {return (VOID);}
"break" {return (BREAK);}
"continue" {return (CONTINUE);}
"return" {printf("<RETURN>");return (RETURN);}
"if" {printf("<IF>");return (IF);}
"else" {printf("<ELSE>");return (ELSE);}
"while" {return (WHILE);}
{intnum} {printf("<INTNUM>");return (INTNUM);}
{floatnum} {return (FLOATNUM);}
"<" {return (LT);}
"<=" {return (LE);}
">" {return (GT);}
">=" {return (GE);}
"==" {return (EQ);}
"!=" {return (NE);}
"=" {printf("<ASSIGN>");return (ASSIGN);}
" + " {return (ADD);}
"-" {return (SUB);}
"*" {return (MUL);}
"/" {return (DIV);}
"%" {return (MOD);}
"!" {return (NOT);}
" & amp; & amp;" {return (AND);}
"||" {return (OR);}
";" {return (SEMI);}
":" {return (COLON);}
"," {return (COMMA);}
"(" {return (L);}
")" {return (R);}
"{" {return (OB);}
"}" {return (CB);}
"[" {return (LB);}
"]" {return (RB);}
{id} {printf("<Ident>");return (Ident);}

The third part:

Helper function section:

%%
int yywrap(){ //File end processing function, if the return value is 1, stop parsing. Can be used to parse multiple files.
    return 1;
}

So far the lexical analyzer has been written, and then we start to generate the grammatical analyzer.

2. Create tiny.y file

The grammatical analyzer accepts the token returned by the lexical analyzer as an input stream. The input token stream does not conform to the grammar specified by the lexical analyzer, and the grammatical analyzer can also report a syntax error. The bison format is very similar to flex.

First part

  1. Option setting %option XXX is used to set some options

2. The C language part %{ %} is similar to flex, and the declaration part is pure C syntax, declaring some included header files and global variables

%option noyywrap
%{
    #include <stdio.h>
    #include <ctype.h>
    #define YYDEBUG 1 //Open debug
    int yylex(); //Call the lexical analyzer and return a TOKEN each time
    int yyerror(char* s);
    extern int line_no;
%}

4. Token declaration part: first defines the union type of yylval, and %token declares the token type. This part will appear in tiny.tab.h for flex to include. When the token matching mode is not specified, the following grammar rules will be It must be represented by the token type instead of characters. For example, if we do not give an LT matching mode, then the grammar rule must be written as:

RelExp : AddExp {$$=$1;}
       | RelExp LT AddExp

Instead it would not be possible to write:

RelExp : AddExp {$$=$1;}
       | RelExp "<" AddExp

Use %start CompUnit to define the start symbol.

Implementation:

%union{}
%token ASSIGN "="
%token LT
%token LE
%token GT
%token GE
%token EQ
%token NE
%token ADD " + "
%token SUB "-"
%token MUL "*"
%token DIV "/"
%token MOD "%"
%token NOT "!"
%token SEMI ";"
%token COLON ":"
%token COMMA ","
%token OB "{"
%token CB "}"
%token LB "["
%token RB "]"
%token L "("
%token R ")"
%token CONST
%token IF
%token ELSE
%token WHILE
%token BREAK
%token RETURN
%token CONTINUE
%token AND
%token OR
%token <const_String_Val> Ident
%token INT
%token FLOAT
%token VOID
%token <const_Int_Val> INTNUM
%token <const_Float_Val> FLOATNUM

%start CompUnit

Second part

Located between two %%: This part is the part of BNF grammar rules. The leftmost part of each rule is a non-terminal symbol, and the right side of the colon is the derivation rule of the non-terminal symbol. If a non-terminal symbol has multiple derivation rules, use the vertical bar | Segmentation, each derivation rule can correspond to an action, contained by { }, written in C language code, and {$$ = $1;} will be executed by default.

For a grammar like this that uses EBNF, we can use left recursion to rewrite it as BNF. After all, the efficiency of bison in dealing with left recursion is relatively good.

ConstDecl: CONST BType ConstDef ";" | ConstDecl "," ConstDef ;

Implementation:

%%
CompUnit: FuncDef {printf("Funcdef successful!");} //Output some prompt information, when the reduction reaches compunit, it means that the syntax analysis has been successful.
        | CompUnit FuncDef
        | Decl {printf("Decl successful!");}
        | CompUnit Decl ;
Decl: Const Decl
     |VarDecl
ConstDecl: CONST BType ConstDef ";"
          |ConstDecl "," ConstDef ;

BType: INT
     | FLOAT
     | VOID;
ConstDef: Ident ConstExpGroup ASSIGN ConstInitVal {}| Ident ASSIGN ConstInitVal{};
ConstExpGroup: "[" ConstExp "]" {} | ConstExpGroup "[" ConstExp "]" {};
ConstInitVal : ConstExp | "{" "}" | "{" ConstInitValGroup "}";
ConstInitValGroup: ConstInitVal {}| ConstInitValGroup "," ConstInitVal {};
VarDecl : BType VarDefGroup ";" {};
VarDefGroup: VarDef {}| VarDefGroup "," VarDef{};
VarDef : Ident {}| Ident ASSIGN InitVal {}| Ident ArrayDef {}| Ident ArrayDef ASSIGN InitVal {};
ArrayDef: "[" ConstExp "]" | ArrayDef "[" ConstExp "]";
InitVal : Exp | "{" "}" | "{" InitValGroup "}";
InitValGroup: InitVal | InitValGroup "," InitVal;
FuncDef : BType Ident "(" ")" Block | BType Ident "(" FuncFParams ")" Block;
FuncFParams : FuncFParam | FuncFParams "," FuncFParam {};
FuncFParam: BType Ident | BType Ident '[' ']' | BType Ident '[' ']' FuncFParamArray;
FuncFParamArray: '[' Exp ']' | FuncFParamArray '[' Exp ']';
Block : "{" "}" {}| "{" BlockGroup "}" {};
BlockGroup: BlockItem {}| BlockGroup BlockItem{};
BlockItem : Decl | Stmt;
Stmt : LVal "=" Exp ";"
| ";" {}
| Exp ";"
| Block
| IF "(" Cond ")" Stmt
| IF "(" Cond ")" Stmt ELSE Stmt
| WHILE "(" Cond ")" Stmt
| BREAK ";"
| CONTINUE ";"
| RETURN ";"
| RETURN Exp ";" ;
Exp : AddExp {$$=$1;};
Cond : LOrExp {$$=$1;};
LVal : Ident | Ident ArrayList;
ArrayList:"[" Exp "]"
         | ArrayList "[" Exp "]";
Number : INTNUM
       | FLOATNUM;
Primary Exp : "(" Exp ")" {$$=$2;}
           | LVal {$$=$1;}
           | Number {$$=$1;};
UnaryExp : PrimaryExp {$$=$1;}
         | Ident "(" ")"
         | Ident "(" FuncParamsGroup ")"
         | UnaryOp UnaryExp;
UnaryOp : ADD
        | SUB
        | NOT ;
FuncParamsGroup: Exp | FuncParamsGroup "," Exp;
MulExp : UnaryExp {$$=$1;}
       | MulExp MUL UnaryExp
       | MulExp DIV UnaryExp
       | MulExp MOD UnaryExp ;
AddExp : MulExp {$$=$1;}
       | AddExp ADD MulExp
       | AddExp SUB MulExp ;
RelExp : AddExp {$$=$1;}
       | RelExp LT AddExp
       |RelExp GT AddExp
       |RelExp LE AddExp
       |RelExp GE AddExp ;
EqExp : RelExp {$$=$1;}| EqExp EQ RelExp | EqExp NE RelExp ;
LAndExp : EqExp {$$=$1;}| LAndExp AND EqExp ;
LOrExp : LAndExp {$$=$1;}| LOrExp OR LAndExp ;
ConstExp : AddExp {$$=$1;};
%%

The third part

The third part is the main function, which directly calls the yyparse function. yyparse is the entry point of the parser generated by Bison. yyparse will continuously call yylex to obtain the token stream to parse and match the grammar rules until the end of the token stream or a grammatical error is found. .

int main(int argc, char** argv){
    extern FILE* yyin;
    if(argc == 2){
if((yyin = fopen(argv[1], "r")) == NULL){
printf("Can't open file %s\
", argv[1]);
return 1;
}
    }
    yyparse();

fclose(yyin);
    }
    return 0;
}

At this point, the parser part has also been completed. We compile the file

Enter bison -d tiny.y on the command line to generate two files tiny.tab.h tiny.tab.c at the same time

Input flex test.l to generate lex.yy.c

gcc lex.yy.c tiny.tab.c -lfl -ly -o parser Generate executable file parser

Run parser

Attribute ./parser XXX (source program name), we write a main.c file to test

main.c

int main(){
    int b[4] = {1,2,3,4}; //define of arraylist;
    if(a[2]>a[1]){
return 1;
    }else{
return 2;
    }
    return 0;
}

./parser main.c run

The output of FuncDef successful means that we have no syntax errors.

Writing main2.c intentionally presents a syntax error

int main(int a){
    if(a = 3){ // here should be a logical expression a==3
return 2
    }
    return;
}

A syntax error was reported.