String matching algorithm KMP

What is the KMP algorithm:

KMP was discovered by three big cows: D.E.Knuth, J.H.Morris and V.R.Pratt at the same time. The first of them is the author of “The Art of Computer Programming”! !

The problem to be solved by the KMP algorithm is the pattern (pattern) positioning problem in the string (also called the main string). To put it simply, it is the keyword search we usually say. The pattern string is the keyword (called P in the following), and if it appears in a main string (called T in the following), its specific position is returned, otherwise -1 is returned (common means).

First of all, there is a very simple idea for this problem: match one by one from left to right, if a character does not match during the process, jump back and move the pattern string one bit to the right. What’s so difficult about it?

We can initialize like this:

After that, we only need to compare whether the character pointed to by the i pointer is consistent with the character pointed to by the j pointer. If they are consistent, they all move backwards, if they are not consistent, as shown in the figure below:

A and E are not equal, then move the i pointer back to the 1st position (assuming that the subscript starts from 0), j moves to the 0th position of the pattern string, and then start this step again:

Based on this idea we can get the following program:

/**
 * Brute force method
 * @param ts main string
 * @param ps pattern string
 * @return If found, return the subscript of the first character in the main string, otherwise -1
 */
public static int bf(String ts, String ps)
{
    char[] t = ts.toCharArray();
    char[] p = ps.toCharArray();

    int i = 0; // the position of the main string
    int j = 0; // the position of the pattern string

    while (i < t.length & amp; & amp; j < p.length)
    {
       if (t[i] == p[j]) // When two characters are the same, compare the next one
        {
           i + + ;
           j + + ;
       }
       else
       {
           i = i - j + 1; // Once there is no match, i goes back
           j = 0; // return j to 0
       }
    }
    if (j == p. length)
    {
       return i - j;
    }
    else
    {
       return -1;
    }
}

The above program is fine, but not good enough!

If it is searched artificially, it will definitely not move i back to the first place, because there is no A in front of the position where the main string fails to match, except for the first A. Why do we know that there is only one A in front of the main string? ? Because we already know that the first three characters all match! (this is very important). Moving past is definitely not a match either! There is an idea that i can not move, we only need to move j, as shown in the figure below:

The above situation is still relatively ideal, and we will compare it again at most. But if you search for “SSSSB” in the main string “SSSSSSSSSSSSSA”, you don’t know the mismatch until the last one is compared, and then i backtrack, the efficiency of this is obviously the lowest.

The big cows can’t bear the inefficient means of “brute force cracking”, so the three of them developed the KMP algorithm. The idea is the same as what we saw above: “use the effective information that has been partially matched, keep the i pointer from backtracking, and modify the j pointer to move the pattern string to a valid position as much as possible.”

Therefore, the whole point of KMP is that when a certain character does not match the main string, we should know where the j pointer should move?

Next, let’s discover the movement law of j by ourselves:

As shown in the figure: C and D do not match, where should we move j? Obviously number 1. Why? Because there is an A in front of it that is the same:

The same is true for the following figure:

You can move the j pointer to the second position, because the two letters in front are the same:

So far we can roughly see a clue, when the matching fails, j will move to the next position k. There is such a property: the first k characters are the same as the last k characters before j.

If expressed in a mathematical formula, this is

P[0 ~ k-1] == P[j-k ~ j-1]

This is very important. If you find it difficult to remember, you can understand it through the following picture:

After figuring this out, it should be possible to understand why j can be moved directly to the k position.

because:

When T[i] != P[j]
There is T[i-j ~ i-1] == P[0 ~ j-1]
By P[0 ~ k-1] == P[j-k ~ j-1]
Necessary: T[i-k ~ i-1] == P[0 ~ k-1]

The formula is very boring, just understand it, don’t need to memorize it.

This paragraph is just to prove why we can directly move j to k without comparing the previous k characters.

Well, the next point is the key point, how to ask for this (these) k? Because a mismatch may occur at each position of P, that is to say, we need to calculate k corresponding to each position j, so we use an array next to save, next[j] = k, which means that when T[i] != When P[j], the next position of the j pointer.

Many textbooks or blog posts are vague or simply mentioned here, or even posted a piece of code. Why do you ask for it? How can I ask for this? It’s not clear at all. And here is precisely the most critical part of the whole algorithm.

public static int[] getNext(String ps)
{
    char[] p = ps.toCharArray();
    int[] next = new int[p. length];

    next[0] = -1;
    int j = 0;
    int k = -1;

    while (j < p.length - 1)
    {
       if (k == -1 || p[j] == p[k])
       {
           next[ + + j] = + + k;
       }
       else
       {
           k = next[k];
       }
    }
    return next;
}

This version of the algorithm for finding the next array should be the most widely circulated, and the code is very concise. But it is really confusing. What is the basis for its calculation?

Well, let’s put this aside first, and let’s deduce the idea by ourselves. Now we must always remember that the value of next[j] (that is, k) means that when P[j] != T[i], the value of the j pointer Next move location.

Let’s look at the first one first: when j is 0, what should we do if there is no match at this time?

In the case of the picture above, j is already on the far left, and it is impossible to move any more. At this time, the i pointer should move backward. So there will be next[0] = -1; this initialization in the code.

What if it is when j is 1?

Obviously, the j pointer must be moved back to the 0 position. Because there is only one position in front of it~~~

The following is the most important, please see the following picture:

Please compare the two images carefully.

We found a rule:

When P[k] == P[j],
There is next[j + 1] == next[j] + 1

In fact, this is provable:

Because P[0 ~ k-1] == p[j-k ~ j-1] before P[j]. (next[j] == k)
At this time, P[k] == P[j], can we get P[0 ~ k-1] + P[k] == p[j-k ~ j-1] + P[j].
That is: P[0 ~ k] == P[j-k ~ j], that is, next[j + 1] == k + 1 == next[j] + 1.

The formula here is not very easy to understand, but it will be easier to understand by looking at the picture.

What if P[k] != P[j]? For example, as shown in the figure below:

In this case, if you look at the code, it should be this sentence: k = next[k]; why is it like this? You should understand the following.

Now you should know why k = next[k]! Like the example above, it is impossible for us to find the longest suffix string [A, B, A, B], but we may still find a prefix string like [A, B], [B]. So this process seems to be positioning the string [A, B, A, C], when C is different from the main string (that is, the position of k is different), then of course the pointer is moved to next[k] .

With the next array, everything is easy to handle, we can write the KMP algorithm by hand:

public static int KMP(String ts, String ps)
{
    char[] t = ts.toCharArray();
    char[] p = ps.toCharArray();

    int i = 0; // the position of the main string
    int j = 0; // the position of the pattern string
    int[] next = getNext(ps);

    while (i < t.length & amp; & amp; j < p.length)
    {
       if (j == -1 || t[i] == p[j]) // When j is -1, it is i to move, of course j should also return to 0
       {
           i + + ;
           j + + ;
       }
       else
       {
           // i don't need to backtrack
           // i = i - j + 1;
           j = next[j]; // j returns to the specified position
       }
    }
    if (j == p. length)
    {
       return i - j;
    }
    else
    {
       return -1;
    }
}

Compared with brute force cracking, 4 places have been changed. The most important point is that i does not need to backtrack.

Finally, let’s look at the flaws in the above algorithm. Let’s look at the first example:

Obviously, the next array obtained by our above algorithm should be [-1, 0, 0, 1]

So the next step we should be to move j to the first element:

It is not difficult to find that this step is completely meaningless. Because the latter B is no longer matched, then the previous B must also be mismatched, and the same situation actually occurs on the second element A.

Obviously, the problem occurs because P[j] == P[next[j]].

So we only need to add a judgment condition:

public static int[] getNext(String ps)
{
    char[] p = ps.toCharArray();
    int[] next = new int[p. length];

    next[0] = -1;
    int j = 0;
    int k = -1;

    while (j < p.length - 1)
    {
       if (k == -1 || p[j] == p[k])
       {
           if (p[ + + j] == p[ + + k]) // skip when two characters are equal
            {
              next[j] = next[k];
           }
           else
           {
              next[j] = k;
           }
       }
       else
       {
           k = next[k];
       }
    }
    return next;
}