A brief discussion on the edit distance algorithm

Interview questions

  • In a long string, search for a short string. The short string may have some noise, that is, some fields do not match, and the short string is about one percent of the length of the long string. How to quickly find this short string? string.
  • At that time, the interviewer reminded me of the edit distance algorithm ({% del, I’m so timid, I’ve never heard of it %}). Later, I looked at it and found that it was actually a classic DP. Alas
  • The edit distance algorithm measures the minimum number of edit operations required to convert one string to another. Because short strings are noisy and will not exactly match substrings in long strings, we can filter certain edits. substring within the number of times.
public class EditDistanceSearch {
    /**
     * @param target target substring
     * @param longString long string
     * @param threshold fault tolerance threshold, i.e. distance number, to solve the noise problem
     * @return
     */
    public static List<String> findSimilarSubstrings(String target, String longString, int threshold) {
        // result set
        List<String> similarSubstrings = new ArrayList<>();
        // Traverse long string
        for (int i = 0; i < longString.length() - target.length(); i + + ) {
            // Take substrings of equal length and calculate the edit distance
            String substring = longString.substring(i, i + target.length());
            int distance = calculateEditDistance(substring, target);
            // Filter based on fault tolerance threshold
            if (distance <= threshold)
                similarSubstrings.add(substring);
        }

        return similarSubstrings;
    }


    // Edit distance algorithm
    public static int calculateEditDistance(String str1, String str2) {
        int m = str1.length();
        int n = str2.length();

        //Initialize matrix
        int[][] dp = new int[m + 1][n + 1];
        for (int i = 0; i <= m; i + + ) {
            dp[i][0] = i;
        }
        for (int j = 0; j <= n; j + + ) {
            dp[0][j] = j;
        }

        // Calculate edit distance
        for (int i = 1; i <= m; i + + ) {
            for (int j = 1; j <= n; j + + ) {
                if (str1.charAt(i - 1) == str2.charAt(j - 1)) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else {
                    dp[i][j] = Math.min(dp[i - 1][j - 1], Math.min(dp[i][j - 1], dp[i - 1][j])) + 1;
                }
            }
        }

        return dp[m][n];
    }

    public static void main(String[] args) {
        String target = "kitten";
        String longString = "skittlesittingbingocat";
        int threshold = 2;

        List<String> similarSubstrings = findSimilarSubstrings(target, longString, threshold);

        System.out.println("Find similar substrings:");
        for (String substring : similarSubstrings) {
            System.out.println(substring);
        }
    }
}

Overview

  • Edit Distance, also known as Levenshtein distance, is a commonly used algorithm to measure the similarity between two strings. It measures the minimum number of edit operations required to convert one string into another, where edit operations can be inserting, deleting, or replacing characters.
  • Some common applications include:
    1. Spelling correction: By calculating the edit distance between the target word and the candidate word, the closest correct spelling of the target word can be found.
    2. Text similarity: Edit distance can be used to compare the similarity between two pieces of text, such as matching similar keywords in text search or finding similar text in document clustering.
    3. Genome alignment: In bioinformatics, edit distance can be used to compare the similarity between DNA sequences or protein sequences for genome alignment and analysis.
    4. Speech recognition: By calculating the edit distance between speech signals, the best matching text transcription can be found in speech recognition tasks.
    5. Data cleaning: In data processing, edit distance can be used to clean and normalize data, such as correcting user input errors or matching similar entity names.
  • Leetcode question link: https://leetcode.cn/problems/edit-distance/
    {% note blud no-icon %}
  • The meaning of dp[i][j] is the minimum number of steps it takes to change word1[i] to word2[j]
    • When word1[i] == word2[j], dp[i][j] = dp[i-1][j-1];
    • When word1[i] != word2[j], dp[i][j] = min(dp[i-1][j-1], dp[i-1][j], dp[i][j -1]) + 1
  • The above state transition equation can be viewed in conjunction with the following table
“” h o r s e
“” 0 1 2 3 4 5
r 1 1 2 2 3 4
o 2 2 1 2 3 4
s 3 3 2 2 2 3
  • first row
    1. Changing from "" to "" requires 0 steps
    2. It takes 1 step to change from "" to h, inserting a h
    3. It takes 2 steps to change from "" to ho. On the basis of 2, insert another o
    4. It takes 3 steps to change from "" to hor. On the basis of 3, insert another r
    5. and so on
  • first row
    1. Changing from "" to "" requires 0 steps
    2. Changing from r to "" requires one step, deleting a r
    3. It takes 2 steps to change from ro to "". On the basis of 2, delete another o
    4. It takes 3 steps to change from r to "". On the basis of 3 steps, delete another s
  • diagonal
    1. It takes 0 steps to change from "" to ""
    2. It takes 1 step to change from r to h, replace r with h
  • Summary: dp[i-1][j-1] represents the replacement operation, dp[i-1][j] represents the deletion operation, dp[i ][j-1] represents the insertion operation.

{% endnote %}

class Solution {
    public int minDistance(String word1, String word2) {
        int m = word1.length();
        int n = word2.length();
        int[][] dp = new int[m + 1][n + 1];

        // initialization
        for (int i = 0; i <= n; i + + ) {
            dp[0][i] = i;
        }
        for (int i = 0; i <= m; i + + ) {
            dp[i][0] = i;
        }

        // Calculate edit distance
        for (int i = 1; i <= m; i + + ) {
            for (int j = 1; j <= n; j + + ) {
                if (word1.charAt(i - 1) == word2.charAt(j - 1)) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else
                    dp[i][j] = Math.min(Math.min(dp[i - 1][j], dp[i][j - 1]), dp[i - 1][j - 1]) + 1;
            }
        }
        return dp[m][n];
    }
}

TODO

  • A new category is added to record the algorithms encountered in the work.