Basic algorithm (continued 2) string matching algorithm

Table of Contents

Article directory

1. Naive Algorithm

2. KMP algorithm (Knuth-Morris-pratt-Algorithm)

3. Rabin-karp algorithm

4. Ooyer-Moore algorithm

5. Sunday algorithm

6. Aho-Corasick algorithm

1. Naive Algorithm

The naive string matching algorithm is a very intuitive string matching algorithm. The basic idea is to compare each character of the string to be found with each character of the target string one by one. This is why it is called “naive” s reason. Since it involves two levels of nested loops, its time complexity is O(mn), where m and n are the lengths of the target string and the string to be found respectively.

Naive matching algorithm C++ code example:

#include <iostream>
#include <string>

void search(std::string pat, std::string txt)
{
    int M = pat.length();
    int N = txt.length();

    for (int i = 0; i <= N - M; i + + )
    {
        int j;

        for (j = 0; j < M; j + + )
            if (txt[i + j] != pat[j])
                break;

        if(j==M)
            std::cout << "Pattern found at index " << i << std::endl;
    }
}

int main()
{
    std::string txt = "ABABDABACDABABCABAB";
    std::string pat = "ABABCABAB";
    search(pat, txt);
    return 0;
} 

Code implementation steps:

  1. Create a program that contains a main string and a target string, and define a search function.

  2. In the search function, iterate over the main string, moving one step at a time. For each substring in the main string, if the substring is equal to the target string, print out the position of the substring in the main string.

  3. The main function calls this search function, passing in the main string and target string as parameters.

2. KMP algorithm (Knuth-Morris-pratt- Algorithm)

The KMP algorithm (Knuth-Morris-Pratt algorithm) is an improved string matching algorithm. Compared with the naive matching algorithm, when the matching fails, by pre-calculating the partial matching table of the target string, backtracking can be avoided to improve the matching efficiency. The time complexity of the KMP algorithm is O(m + n), where m and n represent the lengths of the target string and the string to be found respectively.

KMP algorithm C++ code example:

#include <iostream>
#include <string>
#include <vector>

std::vector<int> computeLPSArray(std::string pat) {
    int M = pat.length();
    std::vector<int> lps(M);
    int len = 0;
    lps[0] = 0;
    int i = 1;
    while (i < M) {
        if (pat[i] == pat[len]) {
            len + + ;
            lps[i] = len;
            i + + ;
        } else {
            if (len != 0) {
                len = lps[len - 1];
            } else {
                lps[i] = 0;
                i + + ;
            }
        }
    }
    return lps;
}

void KMPSearch(std::string pat, std::string txt) {
    int M = pat.length();
    int N = txt.length();
    std::vector<int> lps = computeLPSArray(pat);
    int i = 0;
    int j = 0;
    while (i < N) {
        if (pat[j] == txt[i]) {
            j + + ;
            i + + ;
        }
        if (j == M) {
            std::cout << "Pattern found at index " << (i - j) << std::endl;
            j = lps[j - 1];
        } else if (i < N & amp; & amp; pat[j] != txt[i]) {
            if (j != 0)
                j = lps[j - 1];
            else
                i + + ;
        }
    }
}

int main() {
    std::string txt = "ABABDABACDABABCABAB";
    std::string pat = "ABABCABAB";
    KMPSearch(pat, txt);
    return 0;
}

Code implementation steps:

1. Create a function `computeLPSArray()` that calculates a partial matching array (also called the longest common suffix array). This function calculates the partial match value of the target string. In the function, initialize the length of the target string and set the first element of the partial match table to 0. Then by traversing the target string, using two pointers to find the largest matching suffix value, and filling in the partial matching array.

2. In the KMPSearch function, first calculate the partial matching array of the target string. Then iterate over the string to be found and use two pointers i and j to point to the main string and pattern string respectively. Compare the current characters, and if they match, both pointers will move backward at the same time. Otherwise, adjust the position of the pattern string pointer j according to the partial matching table. When the match is successful, the output pattern string is at the beginning of the main string.

3. In the main function, call the KMPSearch function and pass in the string to be found and the target string as parameters.

3. Rabin-karp algorithm

The Rabin-Karp algorithm is a hash-based string matching algorithm. Use a hash function to calculate the hash value of the search string and the target string, and perform a character-level comparison only when the hash values match. The average time complexity of the Rabin-Karp algorithm under normal circumstances is O(m + n), where m represents the length of the target string and n represents the length of the string to be found. In the worst case, the time complexity may be O(mn).

Rabin-Karp algorithm C++ code example:

#include <iostream>
#include <string>

#define prime 101

long long create_hash_value(const std::string & amp;str, int end) {
    long long hash_val = 0;
    for (int i = 0; i <= end; i + + ) {
        hash_val + = str[i] * pow(prime, i);
    }
    return hash_val;
}

long long recalculate_hash(const std::string & amp;str, int old_index, int new_index, long long old_hash, int pattern_length) {
    long long new_hash = old_hash - str[old_index];
    new_hash /= prime;
    new_hash + = str[new_index] * pow(prime, pattern_length - 1);
    return new_hash;
}

bool checkEqual(const std::string & amp;str1, int start1, const std::string & amp;str2, int start2, int length) {
    for (int i = 0; i < length; i + + ) {
        if (str1[start1 + i] != str2[start2 + i]) {
            return false;
        }
    }
    return true;
}

void RabinKarpSearch(const std::string & amp;pattern, const std::string & amp;text) {
    int m = pattern.length();
    int n = text.length();
    long long pattern_hash = create_hash_value(pattern, m - 1);
    long long text_hash = create_hash_value(text, m - 1);

    for (int i = 1; i <= n - m + 1; i + + ) {
        if (pattern_hash == text_hash & amp; & amp; checkEqual(pattern, 0, text, i - 1, m)) {
            std::cout << "Pattern found at index " << i - 1 << std::endl;
        }
        if (i < n - m + 1) {
            text_hash = recalculate_hash(text, i - 1, i + m - 1, text_hash, m);
        }
    }
}

int main() {
    std::string text = "ABABDABACDABABCABAB";
    std::string pattern = "ABABCABAB";
    RabinKarpSearch(pattern, text);
    return 0;
}

Code implementation steps:

1. Define a function `create_hash_value()` for calculating the hash value of a string, providing the string and the end character position as parameters. Calculate the hash value using a loop over the string.

2. Define a `recalculate_hash()` function to calculate the rolling hash value. Provide the original string, old hash value, old character index, new character index and pattern string length as parameters.

3. Define an auxiliary function `checkEqual()` to perform a character-by-character comparison to determine whether the strings are equal if the hash values match.

4. Use the `RabinKarpSearch()` function to implement the Rabin-Karp algorithm. First calculate the hash value of the pattern string and the hash value of the first m characters of the main string. Iterate over the main string and output the matching position when the hashes match and the strings are equal. Next, the hash value of the next window is calculated until the entire main string is traversed.

5. In the main function, call the `RabinKarpSearch()` function and pass in the string to be found and the target string as parameters.

4. Ooyer-Moore algorithm

The Boyer-Moore algorithm is an efficient string matching algorithm that abandons front-to-back matching and switches to back-to-front matching. When characters do not match, the pattern string is shifted according to the “bad character rules” and “good suffix rules” to reduce the number of redundant comparisons and thereby improve matching efficiency. The preprocessing time complexity of the Boyer-Moore algorithm is O(m) (m represents the length of the target string). The matching time complexity is O(n/m) in the best case and O(n) in the worst case. ) (n is the length of the string to be found).

A stripped-down C++ implementation of the Boyer-Moore algorithm (reduced to a version that only uses the bad character rule):

#include<bits/stdc + + .h>
#define NO_OF_CHARS 256
 
int max(int a, int b) { return (a > b)? a: b; }
 
void badCharHeuristic(char *str, int size,
                        int badchar[NO_OF_CHARS])
{
    int i;
    for (i = 0; i < NO_OF_CHARS; i + + )
         badchar[i] = -1;
    for (i = 0; i < size; i + + )
         badchar[(int) str[i]] = i;
}

void search(char *txt, char *pat)
{
    int m = strlen(pat);
    int n = strlen(txt);
    int badchar[NO_OF_CHARS];
    badCharHeuristic(pat, m, badchar);
    int s = 0;
    while(s <= (n - m))
    {
        int j = m-1;
        while(j >= 0 & amp; & amp; pat[j] == txt[s + j])
            j--;
        if (j < 0)
        {
            printf("\\
 pattern occurs at shift = %d", s);
            s + = (s + m < n)? m-badchar[txt[s + m]] : 1;
        }
 
        else
            s + = max(1, j - badchar[txt[s + j]]);
    }
}
 
int main()
{
    char txt[] = "ABAAABCD";
    char pat[] = "ABC";
    search(txt, pat);
    return 0;
}

Code implementation steps:

1. Define the `max` function, which is used to determine and return the maximum value of two numbers in subsequent steps.

2. Define the `badCharHeuristic` function to preprocess and create the bad character table required for bad character rules. Create and initialize an array of size character set and set all positions in the array to -1. Then iterate over the target string and store its position in the target string (position relative to the end of the string) into an array based on the ASCII value of each character as an index.

3. Create the `search` function to implement the Boyer-Moore algorithm. First calculate the length of the target string and the string to be found, and then call the `badCharHeuristic` function to create a bad character table. Use the variable s to track the module

5. Sunday algorithm

The Sunday algorithm is a string matching algorithm, somewhat similar to the Boyer-Moore algorithm, but simpler than it. The Sunday algorithm can also complete string matching within a time complexity of O(n) (n represents the length of the string to be found). The preprocessing time complexity is O(m + σ), where m represents the length of the target string. σ represents the character set size. In the Sunday algorithm, a structure called “offset table” is mainly used to determine the number of jump steps of the pattern string.

Sunday algorithm C++ code example:

#include<iostream>
#include<string>
#include<vector>
#define NO_OF_CHARS 256
using namespace std;

vector<int> makeOffsetTable(string pat) {
    int m = pat.length();
    vector<int> table(NO_OF_CHARS, m + 1);

    for (int i = 0; i < m; + + i)
        table[(int)pat[i]] = m - i;

    return table;
}

void SundaySearch(string txt, string pat) {
    vector<int> table = makeOffsetTable(pat);
    int m = pat.length();
    int n = txt.length();
    int i = 0;

    while (i <= n - m) {
        if (txt.substr(i, m) == pat) {
            cout << "Pattern found at index " << i << endl;
        }
        i + = table[txt[i + m]];
    }
}

int main() {
    string txt = "ABCDABCDABEE";
    string pat = "ABCDABE";
    SundaySearch(txt, pat);
    return 0;
}

Code implementation steps:

1. First define the function `makeOffsetTable` to create the offset table required by the Sunday algorithm. This creates an array of the size of the character set and initializes each position to the length of the pattern string plus one. Then iterate through the pattern string and save the “offset value” of each character in the pattern string to the array position corresponding to its ASCII value.

2. Define the main function `SundaySearch`, and first construct the offset table based on the pattern string. Then iterate through the main string for matching, matching the substring of the length of the pattern string each time, and if the match is successful, print the matching position. Then, the starting position of the next round of matching is determined based on the offset value in the offset table corresponding to the character located one digit after the current matching string in the main string.

3. In the main function, call the `SundaySearch()` function and pass in the string to be found and the target string as parameters.

6. Aho-Corasick algorithm

Aho-Corasick algorithm is a string matching algorithm, mainly used to find multiple pattern strings in the main string. The main advantage of this algorithm is that it can complete the search for all pattern strings in O(n + m + z) time, where n is the main string length, m is the total length of all pattern strings, and z is the number of matching pattern strings. It mainly uses the concepts of dictionary tree (Trie) data structure and state machine to improve the efficiency of matching.

Aho-Corasick algorithm C++ code example:

#include<bits/stdc + + .h>
using namespace std;
const int ALPHABET_SIZE = 26;

struct AcNode
{
    AcNode *children[ALPHABET_SIZE];
    bool isEndOfWord;
    AcNode *failure;
    int occurrences;
};

AcNode *getNode(void)
{
    AcNode *pNode = new AcNode;
    pNode->isEndOfWord = false;
    for (int i = 0; i < ALPHABET_SIZE; i + + )
        pNode->children[i] = NULL;
    return pNode;
}
  
void insert(AcNode *root, string key)
{
    AcNode *pCrawl = root;
    for (int i = 0; i < key.length(); i + + )
    {
        int index = key[i] - 'a';
        if (!pCrawl->children[index])
            pCrawl->children[index] = getNode();
        pCrawl = pCrawl->children[index];
    }
    pCrawl->isEndOfWord = true;
}

void buildFailureLinks(AcNode *root)
{
    queue<AcNode*> q;
    if(root)
    {
        root->failure = root;
        q.push(root);
    }

    while (!q.empty())
    {
        AcNode *node = q.front();
        q.pop();

        for (int i = 0; i < ALPHABET_SIZE; + + i)
        {
            AcNode *child = node->children[i];
            if (!child)
                continue;
            AcNode *best = node->failure;
            while (best != root & amp; & amp; !best->children[i])
                best = best->failure;
            if (best->children[i])
                best = best->children[i];
            child->failure = best;
            q.push(child);
        }
    }
}

void search(AcNode *root, string key)
{
    AcNode *crawl = root;
    for (int i = 0; i < key.length(); i + + )
    {
        while (crawl != root & amp; & amp; !crawl->children[key[i] - 'a'])
            crawl = crawl->failure;
        if (crawl->children[key[i] - 'a'])
            crawl = crawl->children[key[i] - 'a'];
        AcNode *temp = crawl;
        while (temp != root & amp; & amp; temp->isEndOfWord)
        {
            cout << "Found pattern" << endl;
            temp = temp->failure;
        }
    }
}

int main()
{
    string patterns[] = {"he", "she", "yours", "hers"};
    AcNode *root = getNode();
    for(int i = 0; i < 4; + + i)
        insert(root, patterns[i]);
    buildFailureLinks(root);
    search(root, "yourshehersherssheshehersyours");
    return 0;
}

Code implementation steps:

1. Define the structure AcNode, in which the children array is used to store each child node, isEndOfWord identifies whether it is the end point of a word, and failure points to the node that should be jumped when there is a mismatch.

2. Define the getNode function to create a new node.

3. Create the insert function, which is used to insert a pattern string into the dictionary tree.

4. Create a buildFailureLinks function, which is used to build a jump node when each node is mismatched. Use breadth-first search to compare whether there is a direct jump among the children of the mismatched node of each node.

5. Create a search function to search for the pattern string in the main string in the dictionary tree. When a mismatch occurs, turn to the mismatch node to continue searching.

6. Create the root node of the dictionary tree in the main function, insert the pattern string, build a jump node, and start searching.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56965 people are learning the system