String matching algorithm (BF+KMP)

Article directory

  • string matching algorithm
  • 1. BF algorithm
    • 1. BF code implementation:
  • 2. KMP algorithm
    • 1. Definition
    • 2. The principle of KMP algorithm:
    • 3. Implementation of KMP algorithm
    • 4. Optimization of KMP algorithm

String matching algorithm

1. BF algorithm

The Brute Force algorithm is an ordinary pattern matching algorithm. The idea of the BF algorithm is to match the first character of the target string S with the first character of the pattern string T. If they are equal, continue to compare the second character of S. character and the second character of T; if they are not equal, compare the second character of S and the first character of T, and compare them in turn until the final matching result is obtained. The BF algorithm is a brute force algorithm, and the time complexity of the BF algorithm is O(m*n).

1. BF code implementation:

/*
* Program name: kmp.c, this program demonstrates the matching pattern algorithm of strings, including:
* 1) Ordinary pattern matching algorithm, namely BF algorithm;
* 2) Find the next array of the KMP algorithm;
* 3) Find the nextval array of the KMP algorithm;
* 4) Implementation of KMP algorithm.
*/

#include <stdio.h>
#include <string.h>

//BF algorithm
int index_bf(char sstr[],char tstr[]);

int main()
{
        printf("Please enter a matching target string:");
        char sstr[101]; memset(sstr,0,sizeof(sstr));
        fgets(sstr,100,stdin);
        sstr[strlen(sstr)-1]=0;
        printf("Please enter the matching pattern string:");
        char tstr[101]; memset(tstr,0,sizeof(tstr));
            fgets(tstr,100,stdin);
        tstr[strlen(tstr)-1]=0;
        
        printf("Target string: %s, length is %d\\
", sstr, strlen(sstr));
        printf("pattern string: %s, length is %d\\
", tstr, strlen(tstr));
        printf("BF algorithm checks that the position of the pattern string in the target string is: %d\\
", index_bf(sstr,tstr));
        
        return 0;
}

// Use the BF algorithm to find the position where the pattern string tstr appears in the target string sstr, and the starting position of the string starts from 0.
// As long as the first pattern string tstr is found in the target string sstr, the function returns.
// Successfully returns the array subscript of the first occurrence of the pattern string tstr in the target string sstr, and returns -1 on failure.
int index_bf(char sstr[],char tstr[])
{
        if( sstr==0 || tstr==0 ) return -1;
        int pos1, pos2; //Two pointers to traverse the string
             pos1=pos2=0;

        // "hello world"
        // "ll0"
        while( pos1<strlen(sstr) & amp; & amp; pos2<strlen(tstr) )
        {

                // Compare the characters of the target string and the pattern string one by one
                //If they are equal, the pattern string and the target string are moved backward at the same time
                if( sstr[pos1] == tstr[pos2] )
                {
                        pos1++;
                        pos2++;
                }
                //If not equal, the target string backtracks to the start position + 1 where the match failed this time, and the pattern string backtracks to the first character
                else {
                        pos1=pos1-pos2 + 1;
                        pos2=0;
                }
        }

        //The match is successful, return the first position of the pattern string in the target string
        if( pos2==strlen(tstr) ) return pos1-strlen(tstr);
        return -1;
}

Test instance:

2. KMP algorithm

1. Definition

Its core is to use the information after the matching fails to minimize the number of matches between the pattern string and the main string to achieve the purpose of fast matching. The specific implementation is realized through a next() function, and the function itself contains partial matching information of the pattern string. The time complexity of the kmp algorithm is O(m + n).

2. The principle of KMP algorithm:

(1) During the matching process, the pointer of the target string does not need to be traced back, only the pointer of the pattern string is traced back;
(2) If the first n characters of the pattern string and the target string are successfully matched, when a character that fails to match is encountered, the backtracking position of the pattern string pointer is determined by the content of the pattern string (backtracking to the pattern string before the matching failure position The length of the longest common prefix and suffix of the content + 1), and then continue to compare.

3. Implementation of KMP algorithm

(1) According to the content of the pattern string, generate an array (next array) of the backtracking position of the pointer of the pattern string when a character that fails to match is encountered.
The function getnext() in the following code realizes the production of the next array;

(2) Slightly modify the BF algorithm, when the matching fails, the position pointer of the pattern string traces back to the position indicated by the next array.

Code:

/*
* Program name: kmp.c, this program demonstrates the matching pattern algorithm of strings, including:
* 1) Ordinary pattern matching algorithm, namely BF algorithm;
* 2) Find the next array of the KMP algorithm;
* 3) Find the nextval array of the KMP algorithm;
* 4) Implementation of KMP algorithm.
*/
//KMP algorithm
 // Calculate the next array according to the pattern string tstr, and the starting position of the string starts from 0.
void getnext(char tstr[],int next[]);
int index_kmp(char sstr[],char tstr[]);


int main()
{<!-- -->
        printf("Please enter a matching target string:");
        char sstr[101]; memset(sstr,0,sizeof(sstr));
        strcpy(sstr,"hello world");
        fgets(sstr,100,stdin);
        sstr[strlen(sstr)-1]=0;
        printf("Please enter the matching pattern string:");
        char tstr[101]; memset(tstr,0,sizeof(tstr));
        strcpy(tstr,"llo");
        fgets(tstr,100,stdin);
        tstr[strlen(tstr)-1]=0;
       printf("target string %s, length is %d\\
", sstr, strlen(sstr));
        printf("pattern string: %s, length is %d\\
", tstr, strlen(tstr));
/*
        int next[strlen(tstr)];
        getnext(tstr,next);
        printf("pattern string next array value:");
        int ii;
        for(ii=0;ii<strlen(tstr);ii++)
        {
                printf("= ",next[ii]);
        }
        printf("\\
");
*/

        printf("KMP algorithm checks that the position of the pattern string in the target string is: %d\\
", index_kmp(sstr,tstr));
        //printf("BF algorithm checks that the position of the pattern string in the target string is: %d\\
",index_bf(sstr,tstr));
        return 0;
}

// Calculate the next array according to the pattern string tstr, and the starting position of the string starts from 0.
void getnext(char tstr[],int next[])
{<!-- -->
        if( tstr==0 || next==0 ) return ;

        int len=strlen(tstr); //Get the character length of the pattern string
        //The first position is fixedly filled with -1 and the second position is fixedly filled with 0
        if(len ==0)return;
        if(len == 1){<!-- -->next[0]=-1; return ;}
        if(len == 2){<!-- -->next[0]=-1; next[1]=0; return ;}

        next[0]=-1;next[1]=0;
        //Calculate the next pointer from the third position (the length of the longest common prefix and suffix plus 1)
        int ii;
        char str1[1001];
        char str2[1001];
        int maxlen=0;
        for(ii=2; ii<len; ii + + )
 {<!-- -->
                int jj;
                for(jj=1; jj<ii; jj ++ )
                {<!-- -->
                        memset(str1,0,sizeof(str1));
                        memset(str2,0,sizeof(str2));

                        strncpy(str1,tstr,jj); //Get the prefix
                        strncpy(str2,tstr + ii-jj,jj); //Get the suffix
                        //printf("prefix=%s, suffix=%s\t", str1, str2);
                        //The prefix is equal to the suffix, record its length
                        if( strcmp(str1,str2)==0 )maxlen=jj;
                }
                //printf("\\
");
                next[ii]=maxlen;
                maxlen=0;
        }
        return;
}

// Use the kmp algorithm to find the position where the pattern string tstr appears in the target string sstr, and the starting position of the string starts from 0.
// // As long as the first pattern string tstr is found in the target string sstr, the function returns.
// // Successfully returns the array subscript where the pattern string tstr appears for the first time in the target string sstr, and returns -1 if it fails.
int index_kmp(char sstr[], char tstr[])
{<!-- -->
        if( sstr==0 || tstr==0 ) return -1;
        int pos1, pos2; //Two pointers to traverse the string
        pos1=pos2=0;

        // Get the next array.
        int next[strlen(tstr)];
        getnext(tstr,next);
\t\t
//int nextval[strlen(tstr)];
//getnextval(tstr,next,nextval);
\t\t
        int slen = strlen(sstr);
        int tlen=strlen(tstr);
        // "hello world"
        // "ll0"
        //while( pos1<(strlen(sstr)) & amp; & amp; (pos2<(strlen(tstr))) ) This line of code does not work, the second loop will jump out directly, the reason is unknown
        while( pos1<slen & amp; & amp; pos2<tlen )
        {<!-- -->
         // Compare the characters of the target string and the pattern string one by one
                //If they are equal, the pattern string and the target string are moved backward at the same time
                if( pos2==-1 || sstr[pos1]==tstr[pos2] )
                {<!-- -->
                        pos1++;
                        pos2++;
                }
                else{<!-- -->
                        pos2=next[pos2];
                        pos2=nextval[pos2];
                }
        }
        //The match is successful, return the first position of the pattern string in the target string
        if( pos2 == strlen(tstr) )
        {<!-- --> return (pos1-strlen(tstr));}
        return -1;
}

Test instance:

4. Optimization of KMP algorithm

Use an example to illustrate the necessity of optimization:
Target String: aaaabaaaacaaaadaaaaa
Pattern string: aaaaa
The pattern string next array is: -1 0 1 2 3

Where the first mismatch occurs:

At this time, the pattern string backtracks to the position of next[4], and its value is still a

Where the second mismatch occurs:


At this time, the pattern string continues to backtrack, backtracking to the position of next[3], its value is still a, and so on, until the backtracking to next[0], the target string will move back one position;

So it is necessary to optimize the kmp algorithm,
When backtracking, if it is found that the element value of the backtracking position is equal to the value of the current matching failure position, then directly set the value of nextval of the position as the value of nextval of the position to be backtracked;
Use a nextval array to store the backtracking position of each position;

Code:

// Calculate the nextval array according to the pattern string tstr and the next array, and the starting position of the string starts from 0.
// void getnextval(char *tstr,int *next,int *nextval)
void getnextval(char tstr[],int next[],int nextval[])
{<!-- -->
  if ( (tstr==0) || (next==0) || (nextval==0) ) return; // Determine the null pointer.

  int tlen=strlen(tstr); // The length of the pattern string.

  nextval[0]=-1; // Position 0 is directly filled with -1.

  int ii;

  // Start scanning from the first position.
  for (ii=1;ii<tlen;ii + + )
  {<!-- -->
  //The element value at the current position is equal to the element value at the backtracking position
    if (tstr[ii]==tstr[next[ii]])nextval[ii]=nextval[next[ii]];
    else
         nextval[ii]=next[ii];
  }
}