KMP string – explain the next array in simple terms

Contents (analysis section)
1. Pre-knowledge (text string, pattern string concept, violent thought)
2. Encounter next (intuitive understanding, it is recommended to draw a picture together)
3. A showdown with next (detailed explanation of the next array, note that it is the next array of a single string)
4. hello, kmp! (The core of the kmp algorithm, please enter this part on the premise of mastering next, otherwise, learn the previous part first until you fully master it)
5. Why O(n + m)? (time complexity analysis)
Code (detailed comments)

Analysis:

Prerequisite knowledge

In the string matching problem, two strings text and pattern are given (in this question, text is S, pattern is P), and it needs to be judged Whether pattern is a substring of text. Generally, text is called a text string, and pattern is called a pattern string. The brute force solution is:
Enumerate the starting position i of the text string text , and then match the pattern string pattern bit by bit starting from this bit. If every bit is the same during the matching process, the match is successful; otherwise, as long as a certain bit is different, let the starting position of the text string text become i + 1 , and start matching the pattern string pattern from the beginning. Assume that m is the length of the text string, and n is the length of the pattern string. The time complexity is O(nm). Obviously, when both n and m reach the 105 level it cannot bear
So what should we do? Here, we will spoil some KMP optimization steps before the topic, but we will not go too deep, just let us have an intuitive understanding and find the way forward. For text and pattern if there is an i (i is much smaller than n and m), making text [i + 1] != pattern[i + 1], indicating that the current matching fails, we will definitely not use the above brute force solution, we can think:
Can you find a variable in text[i] and pattern[i], and write it as what? How about length? Make pattern[0…length] == pattern[i - length…i] == text [i - length…i] these three Parts are equal, so that pattern does not need to start matching from the beginning (that is, the prefix of pattern and the suffix of text are equal, push it immediately! If this part matches again, it must still can match successfully). Instead, match pattern[length + 1] with text[i] . If it succeeds, go on down. If it fails, let’s talk~
OK, let’s call it a day. It doesn’t matter if you can’t figure out the little spoiler above. Next, we will think about how to realize this step step by step.

Encounter next

Facing a complex algorithm like KMP, we must disassemble it and break it one by one. Suppose there is a string s (subscript starts from 0), then its substring ending with i is s[0…i]. For this substring, the prefix and suffix of length k + 1 are s[0…k] and s[i-k…i] respectively code>. We construct an array of int type (it doesn’t matter what it is called, just call it next). Among them, next[i] means to make the prefix s[0…k] equal to the suffix in the string s[0…i] The largest k of s[i-k…i]. (Note that the equal prefix and suffix cannot be s[0…i] itself in the original string. This is very important and will be used in perceptual analysis later); if no equal prefix suffix, let next[i] = -1. Obviously, next[i] is the subscript of the last digit of the prefix in the longest equal prefix and suffix.
Next, through an example, the change process of the next array is given. s = “abababc”. For each evaluation of next[i] two readings are given. It is recommended to look at the first and figure out the second.

The first method directly draws the longest equal prefix and suffix of the substring s[0…i]:
The second method gives the suffix in the upper part and the prefix in the lower part, and then frames the longest equal prefix and suffix.

We conduct a perceptual analysis of these two methods (if the above two figures can be fully understood, this part can be skipped):

  1. i = 0: the substring s[0…i] is “a”, since no equivalent prefix and suffix can be found (neither the prefix nor the suffix can be the substring s[0…i] itself), so set next[0] = – 1.
  2. i = 1: The substring s[0…i] is “ab”, since no equivalent prefix and suffix can be found (neither the prefix nor the suffix can be the substring s[0…i] itself), so set next[1] = – 1.
  3. i = 2: The substring s[0…i] is “aba”, and the largest k that can make the prefixes and suffixes equal is equal to 0. At this time, the suffix s[i-k…i] is “a”, and the prefix s[0…k] Also “a”; and when k = 1, the suffix [i-k…i] is “ba”, the prefix s[0…k] is “ab”, they are not equal, so next[2] = 0.
  4. i = 3: the substring s[0…i] is “abab”, the largest k that can make the prefixes and suffixes equal is equal to 1, at this time the suffix s[i-k…i] is “ab”, the prefix s[0…k] Also “ab”; and when k = 2, the suffix s[i-k…i] is “bab”, the prefix s[0…k] is “aba”, they are not equal, so next[3] = 1.
  5. i = 4: The substring s[0…i] is “ababa”, the largest k that can make the prefixes and suffixes equal is equal to 2, at this time the suffix s[i-k…i] is “aba”, the prefix s[0…k] Also “aba”; when k = 3, the suffix s[i-k…i] is “baba”, the prefix s[0…k] is “abab”, they are not equal, so next[4] = 2.
  6. i = 5: the substring s[0…i] is “ababaa”, the largest k that can make the prefixes and suffixes equal is equal to 0, at this time the suffix s[i-k…i] is “aba”, the prefix s[0…k] Also “a”; and when k = 1, the suffix s[i-k…i] is “aa”, the prefix s[0…k] is “ab”, they are not equal, so next[5] = 0.
  7. i = 6: The substring s[0…i] is “ababaab”, the largest k that can make the prefixes and suffixes equal is equal to 1, at this time the suffix s[i-k…i] is “ab”, and the prefix s[0…k] Also “ab”; and when k = 2, the suffix s[i-k…i] is “aab”, the prefix s[0…k] is “aba”, they are not equal, so next[6] = 1.

Here it is emphasized again: next[i] is the next digit of the last digit of the prefix that makes the substring s[0…i] have the longest equal prefix and suffix mark.
This sentence may be confusing at first reading, but please be sure to understand that we named it Supreme Concept, which is the key to solving the problem.

Showdown with next
So how to solve next? Brute force is possible, but requires too many passes. Next, use the “recursion” method to efficiently solve the next array. That is to say, we assume that next[0] ~ next[i-1] has been obtained, and use them to calculate next[i].

Let’s still use the s = “abababc” we just knew perceptually as an example. Suppose you already have next[0] = -1, next[1] = -1, next[2] = 0, next[3] = 1, now to solve next[4]. As shown in the figure below, when next[3] = 1 has been obtained, the longest equal prefix and suffix is “ab”, and when next[4] is calculated later, due to s[4] == s[next[3] + 1] (Why use next[3] here? Think about the supreme concept), so the longest phase Wait for the suffix “ab” to expand to “aba”, so next[4] = next[3] + 1, and let j point to next[4] .


Then solve next[5] on this basis. As shown in the figure below, when next[4] = 2 has been obtained, the longest equal prefix and suffix is “aba”, and when next[5] is calculated later, due to s[5] != s[next[4] + 1], so the current equal prefix and suffix cannot be extended, that is, it cannot be directly obtained by the method of next[4] + 1 next[5]. Since the equal suffix and suffix cannot reach that long, it might as well shorten it a bit! At this time, I hope to find a j, so that s[5] == s[j + 1] is established, and at the same time make the wavy line~ in the figure, that is, s[0…j] is the suffix of s[0…2] = “aba”, and s[0…j] is s[0…2] prefix is obvious. At the same time, in order to find the equal prefix and suffix as long as possible, the j should be as large as possible.


In fact, the ~ part we need to solve in the above figure, that is, s[0…j] is not only the prefix of s[0…2] = “aba”, but also s[0…2] = "aba", and want its length as long as possible, then s[0…j] is the best for s[0…2] The length is equal to the prefix and suffix. That is to say, you only need to make j = next[2], and then judge whether s[5] == s[j + 1] is true: if it is true, it means s[0…j + 1] is the longest equal prefix and suffix of s[0…5], let next[5] = j + 1; if not true , keep j = next[j] until j returns to -1, or on the way s[5] = = s[j + 1] established.

As shown in the figure above, j rolls back from 2 to next[2] = 0, and finds that s[5] == s[j + 1] is not true, so continue to let j roll back from 0 to next[0] = -1; Since j has rolled back to -1, it will not continue to roll back. At this time, it is found that s[i] == s[j + 1] is established, indicating that s[0...j + 1] is the longest equal prefix and suffix of s[0...5], so let next[5] = j + 1 = -1 + 1 = 0, and let j point to next[5].

The following summarizes the solution process of the next array and gives the code:

Initialize the next array, let j = next[0] = -1.
Let i traverse in the range of 1 ~ len - 1, for each i, execute 3, 4 to solve next[i].
Until j rolls back to -1, or s[i] == s[j + 1] is established, otherwise keep j = next[j].
If s[i] == s[j + 1], then next[i] = j + 1; otherwise next[i] = j.

next[0] = -1;
for (int i = 1, j = -1; i < len; i ++ )
{<!-- -->
    while (j != -1 & amp; & amp; s[i] != s[j + 1]) // prefix and suffix matching failed
    {<!-- -->
        // Repeatedly roll j back to -1, or s[i] == s[j + 1]
        j = next[j];
    }
    if (s[i] == s[j + 1]) // match succeeds
    {<!-- -->
        j + + ; // the longest equal suffix becomes longer
    }
    next[i] = j; // let next[i] = j
}

Please also be clear: we just did next array processing on a string! ! !

hello, kmp!

On this basis, we enter kmp. With the basis of finding the next array above, the kmp algorithm is following the gourd. Given a text string text and a pattern string pattern, and then judges whether the pattern string pattern is a substring of the text string string.
Take text = "abababaabc", pattern = "ababaab" as an example. Let i point to the current bit to be compared in text, and let j point to the last bit that has been matched in pattern, so that as long as text[i] == pattern[j + 1] is established, it means that pattern[j + 1] is also successful Match, at this time, add 1 to i and j to continue the comparison until j reaches m - 1 (m is the length of pattern), indicating that pattern is a substring of text. In this example, i points to text[4] and j points to pattern[3], indicating that pattern[0...3] has all been matched successfully. At this time, it is found that text[i] == pattern[j + 1] is established, which is It means that pattern[4] matches successfully, so add 1 to i and j.


Then continue to match. At this time, i points to text[5] and j points to pattern[4], indicating that pattern[0...4] has all been matched successfully. So try to judge whether text[i] == pattern[j + 1] is true: if it is true, then pattern[0...5] is successfully matched, and i and j can be added by 1 to continue to match the next bit. But here text[5] != pattern[4 + 1], the match fails. So what do we do here? Abandon the previous successful matching results of pattern[0…4], let j go back to -1 and start matching again? That's a brute force solution, let's take a look at how kmp handles it.

In order not to allow j to fall back directly to -1, it is necessary to seek to fall back to a j' closest to the current j (j is 4 at this time), so that text[i] == pattern[j' + 1] can be established, And pattern[0…j'] still matches the corresponding position of text, that is, pattern[0…j'] is the suffix of pattern[0…j]. This is easy to think of similar problems encountered in the previous nnext array. The answer is that pattern[0…j’] is the longest equal prefix and suffix of pattern[0…j]. In other words, you only need to keep j = next[j] until j falls back to -1 or text[i] == pattern[j + 1] is established, and then continue to match. The meaning of the next array is the position j should fall back to when j + 1 bit mismatch. For the example just now, when text[5] does not match pattern[4 + 1], let j = next[4] = 2, and then we will find that text[i] == pattern[j + 1] can be established, So let it continue matching until j == 6 also matches, which means pattern is a substring of text.

The general idea of the kmp algorithm is as follows:

  1. Initialize j = -1, indicating the last bit of the pattern that is currently matched.
  2. Let i traverse the text string text, and for each i, execute 3 and 4 to try to match text[i] and pattern[j + 1].
  3. Until j rolls back to -1 or text[i] == pattern[j + 1], otherwise keep j = next[j].
  4. If text[i] == pattern[j + 1], then let j ++. If j reaches pattern_len - 1, pattern is a substring of text.
// Omit the step of finding the next array
int j = -1; // means that no one has been matched yet
for (int i = 0; i < text_len; i ++ )
{<!-- -->
    while (j != -1 & amp; & amp; text[i] != pattern[j + 1])
    {<!-- -->
        j = next[j];
    }
    // text[i] matches pattern[j + 1] successfully, let j + 1
    if (text[i] == pattern[j + 1])
    {<!-- -->
        j + + ;
    }
    if (j == pattern_len - 1) // It is a substring, processed according to the requirements of the topic
}

If we observe the above analysis, can we find that:The process of solving the next array is actually the process of self-matching of the pattern string pattern.

Consider how to count the starting subscripts of pattern in text:
When j = m - 1, it means that the pattern matches completely, and at this time, i - j can be output (the end position of the text minus the length of the pattern is the subscript of the pattern in the text). But the question is: Where should the next match start in the pattern after ? Since multiple occurrences of pattern in text may overlap, it is not possible to add 1 to i to continue the comparison without doing anything, but to let j go back a certain distance first. At this time, next[j] represents the longest equal prefix and suffix of the entire pattern. From this position, let j be the largest, that is, make the matched part the longest, so as to ensure that there is no missing solution, and save a lot for the next match pointless comparison.

whyO(n + m)

We see that each i in the for loop has a while loop, so the number of j rollbacks may be unpredictable. Why is the time complexity of KMP O(n + m)?
First of all, i is continuously incremented by 1 in the whole for loop of kmp, so the number of changes of i in the whole process is O(m) level. Next, considering the change of j, we notice that j will only increase in one row, and Only add 1 each time, so that j can be increased by at most m times in the whole process; in other cases, j is constantly decreasing, since the minimum value of j will not be less than -1, so in the whole process, j can only be decreased by at most m times . That is to say, the while loop will only be executed at most m times for the whole process, so the number of j changes in the whole process is O(m) level. Since the number of times i and j change throughout the process is O(m), the overall complexity of the for loop part is O(m). Considering that calculating the next array requires a time complexity of O(n) (the analysis method is the same as above), the kmp algorithm requires a total time complexity of O(n + m).

Code (c++)

include <iostream>

using namespace std;

const int N = 1000010;
char p[N], s[N]; // use p to match s
// "next" array, if bit i stores value k
// Indicates that the last subscript of the prefix with the longest equal prefix and suffix in p[0...i] is k
// i.e. p[0...k] == p[i-k...i]
int ne[N];
int n, m; // n is the template string length m is the pattern string length

int main()
{<!-- -->
    cin >> n >> p >> m >> s;

    // There must be no equal prefixes and suffixes in the interval of p[0...0]
    ne[0] = -1;

    // Construct the next array of template strings
    for (int i = 1, j = -1; i < n; i ++ )
    {<!-- -->
        while (j != -1 & amp; & amp; p[i] != p[j + 1])
        {<!-- -->
            // If the prefix and suffix match is unsuccessful
            // Repeatedly make j go back until it reaches -1 or s[i] == s[j + 1]
            j = ne[j];
        }
        if (p[i] == p[j + 1])
        {<!-- -->
            j + + ; // When the match is successful, the longest equal suffix becomes longer, and the last digit of the longest equal suffix becomes larger
        }
        ne[i] = j; // Let ne[i] = j for easy calculation next[i + 1]
    }

    // kmp start !
    for (int i = 0, j = -1; i < m; i ++ )
    {<!-- -->
       while (j != -1 & amp; & amp; s[i] != p[j + 1])
       {<!-- -->
           j = ne[j];
       }
       if (s[i] == p[j + 1])
       {<!-- -->
           j ++ ; // When the match is successful, the template string points to the next bit
       }
       if (j == n - 1) // template string matching is complete, the subscript of the first matched character is 0, so n - 1
       {<!-- -->
           // When the match is successful, the end position of the text string minus the length of the pattern string is the start position
           cout << i - j << ' ';

           // The positions of the template strings in the pattern string may overlap
           // It is necessary to return j to a certain position, and then add 1 to i to continue the comparison
           // Falling back to ne[j] can ensure that j is the largest, that is, the part that has been successfully matched is the longest
           j = ne[j];
       }
    }

   return 0;
}


KMP string - explain the next array in simple terms

syntaxbug.com © 2021 All Rights Reserved.