Write your own compiler: Create a syntax parser compiled from C language

In the previous chapter, we completed an input system designed in C language. In this section we will look at how to complete a lexical parser designed and compiled in C language based on the previous section. The basic design idea of the entire parser is:
1. The input system we designed in the previous section reads the string from the file.
2. The state machine code generated by our previous GoLex programming is responsible for reading the string read in step 1 for identification.
3. The template code designed in C language drives the execution of steps 1 and 2
Let’s take a look at the specific operation. First, we need to put the functions corresponding to the input system designed in the previous section into the header file, and add a header file l.h to the CLex project. The code content is as follows:

#ifndef __L_H
#define __L_H
extern int ii_newfile(char* name);
extern unsigned char* ii_text();
extern int ii_flush(int force);
extern int ii_length();
extern int ii_lineno();
extern unsigned char* ii_ptext();
extern int ii_plength();
extern int ii_plineno();
extern unsigned char* ii_mark_start();
extern unsigned char* ii_mark_end();
extern unsigned char* ii_move_start();
extern unsigned char* ii_to_mark();
extern unsigned char* ii_mark_prev();
extern int ii_advance();
extern int ii_flush(int force);
extern int ii_fillbuf(unsigned char* starting_at);
extern int ii_look(int n);
extern int ii_pushback(int n);
extern void ii_term();
extern void ii_unterm();
extern int ii_input();
extern void ii_unput(int c);
extern int ii_lookahead(int n);
extern int ii_flushbuf();
#endif

Then generate the C language code of the state machine in GoLex. The code in main.go is as follows (we have explained and debugged these codes in the previous chapters):

func main() {<!-- -->
lexReader, _ := nfa.NewLexReader("input.lex", "output.py")
lexReader.Head()
parser, _ := nfa.NewRegParser(lexReader)
start := parser.Parse()
parser.PrintNFA(start)
\t
nfaConverter := nfa.NewNfaDfaConverter()
nfaConverter.MakeDTran(start)
nfaConverter.PrintDfaTransition()

nfaConverter.MinimizeDFA()
fmt.Println("---------new DFA transition table ----")
nfaConverter.PrintMinimizeDFATran()

//nfaConverter.DoPairCompression()
nfaConverter.DoSquash()
nfaConverter.PrintDriver()
}

After the above code is run, we will generate the lex.yy.c file locally, and we copy all the code in the file to the main.c file of CLex. Next, we need a “glue” code that drives the input system to read the data, and then calls the generated state machine code for string recognition. Its name is yylex. Still in main.c, enter the code corresponding to the yylex function as follows:

int yylex() {<!-- -->
    static int yystate = -1;
    int yylastaccept;
    int yyprev;
    int yynstate;
    int yylook; //Pre-read characters
    int yyanchor;

    if(yystate == -1) {<!-- -->
        //Read data into buffer
        ii_advance();
        //ii_advance moves the Next pointer one bit, so we need to move it back before we read any characters.
        ii_pushback(1);
    }

    yystate = 0;
    yyprev = 0;
    yylastaccept = 0;
    ii_unterm();
    ii_mark_start();
    while(1) {<!-- -->
        /*
        * Here we adopt a greedy algorithm. If the currently recognized string has entered the recognition state,
        * But there are still characters that can be read, so we first cache the current recognition status, and then continue to recognize subsequent characters,
        * Until the file reaches the end or the entered characters cause recognition failure, at this time we return to the last recognition state
        * Processing, this method allows us to get the longest string that can enter the completion state as much as possible
        */
        while(1) {<!-- -->
            yylook = ii_look(1);
            if (yylook != EOF) {<!-- -->
                yynstate = yy_next(yystate, yylook);
                break;
            } else {<!-- -->
                if (yylastaccept) {<!-- -->
                    /*
                     * If the file data is read and we have reached the completion state, then set the next state to
                     * Illegal status
                     */
                    yynstate = YYF;
                    break;
                }
                else if(yywrap()) {<!-- -->
                    yytext = "";
                    yyleng = 0;
                    return 0;
                }
                else {<!-- -->
                    ii_advance();
                    ii_pushback(1);
                }
            }
        }// inner while

        if (yynstate != YYF) {<!-- -->
            //Jump to next valid state
            printf("Transation from state %d ", yystate);
            printf(" to state %d on <%c>\\
", yynstate, yylook);

            if (ii_advance() < 0) {<!-- -->
                //Buffer is full
                printf("Line %d, lexeme too long. Discarding extra characters.\\
", ii_lineno());
                ii_flush(1);
            }

            yyanchor = Yyaccept[yynstate];
            if (yyanchor) {<!-- -->
                yyprev = yystate;
                yylastaccept = yynstate;
                ii_mark_end(); //Complete the recognition of a string
            }

            yystate = yynstate;
        } else {<!-- -->
            //Jump to invalid state, indicating that the input string is illegal
            if (!yylastaccept) {<!-- -->
                //Ignore illegal characters
                printf("Ignoring bad input\\
");
                ii_advance();
            } else {<!-- -->
                //Return to the last accepted state
                ii_to_mark();
                if (yyanchor & amp; 2) {<!-- -->
                    // Match at the end, put the carriage return symbol at the end back into the buffer
                    ii_pushback(1);
                }
                if (yyanchor & amp; 1) {<!-- -->
                    //Match at the beginning, ignore the carriage return symbol at the beginning of the string
                    ii_move_start();
                }
                ii_term();
                //Get the currently recognized string, its length and line number
                yytext = (char*) ii_text();
                yyleng = ii_length();
                yylineno = ii_lineno();

                printf("Accepting state%d, ", yylastaccept);
                printf("line %d: <%s>\\
", yylineno, yytext);

                switch (yylastaccept) {<!-- -->
                    /*
                     * Here the corresponding code is executed according to the acceptance status. In fact, the code here
                     * Will be generated later by GoLex
                     */
                    case 3:
                    case 4:
                        printf("%s is a float number", yytext);
                        return FCON;
                    default:
                        printf("internal error, yylex: unkonw accept state %d.\\
", yylastaccept);
                        break;
                }
            }

            ii_unterm();
            yylastaccept = 0;
            yystate = yyprev;

        }

    }//outer while
}

Finally, we call the yylex function in the main function to drive the recognition process. The content of the main function is as follows:

int main() {<!-- -->
    int fd = ii_newfile("/Users/my/Documents/CLex/num.txt");
    if (fd == -1) {<!-- -->
        printf("value of errno: %d\\
", errno);
    }
    yylex();
    return 0;
}

After completing the above code, we compile the C language code and generate an executable file. Note that in the above code, we use the ii_newfile function of the input system to read in a file named num.txt. The content of this file contains the necessary The recognized string, in fact, this file address can be input as a program parameter. For simplicity, we write it directly into the code, create the file num.txt locally, enter a numeric string 3.14 in it and save it. Finally, we execute c The program compiled by the language code has the following output:

Transation from state 0 to state 4 on <3>
Transation from state 4 to state 3 on <.>
Transation from state 3 to state 3 on <1>
Transation from state 3 to state 3 on <4>
Accepting state3, line 1: <
3.14>

Here we can see that the created C language code can correctly identify the string in the given file as a floating point number. At the same time, it prints out the state jump of the state machine when identifying each character. From this, we basically conclude that The design of the c language code is basically correct. Our goal in the next section is to replace all the current “manual” stages with programs. For example, we use code to paste the code generated by GoLex. When these codes are generated and After the code pasting action is completed by GoLex, it becomes a famous Flex application in the compilation principle tool chain. For more details, please search Coding Disney on station b and download the code:
Link: https://pan.baidu.com/s/1MHkg0qNV8QIEqtC_y0XjnA Extraction code: 1r4x