Interview questions
- In a long string, search for a short string. The short string may have some noise, that is, some fields do not match, and the short string is about one percent of the length of the long string. How to quickly find this short string? string.
- At that time, the interviewer reminded me of the edit distance algorithm ({% del, I’m so timid, I’ve never heard of it %}). Later, I looked at it and found that it was actually a classic DP. Alas
- The edit distance algorithm measures the minimum number of edit operations required to convert one string to another. Because short strings are noisy and will not exactly match substrings in long strings, we can filter certain edits. substring within the number of times.
public class EditDistanceSearch { /** * @param target target substring * @param longString long string * @param threshold fault tolerance threshold, i.e. distance number, to solve the noise problem * @return */ public static List<String> findSimilarSubstrings(String target, String longString, int threshold) { // result set List<String> similarSubstrings = new ArrayList<>(); // Traverse long string for (int i = 0; i < longString.length() - target.length(); i + + ) { // Take substrings of equal length and calculate the edit distance String substring = longString.substring(i, i + target.length()); int distance = calculateEditDistance(substring, target); // Filter based on fault tolerance threshold if (distance <= threshold) similarSubstrings.add(substring); } return similarSubstrings; } // Edit distance algorithm public static int calculateEditDistance(String str1, String str2) { int m = str1.length(); int n = str2.length(); //Initialize matrix int[][] dp = new int[m + 1][n + 1]; for (int i = 0; i <= m; i + + ) { dp[i][0] = i; } for (int j = 0; j <= n; j + + ) { dp[0][j] = j; } // Calculate edit distance for (int i = 1; i <= m; i + + ) { for (int j = 1; j <= n; j + + ) { if (str1.charAt(i - 1) == str2.charAt(j - 1)) { dp[i][j] = dp[i - 1][j - 1]; } else { dp[i][j] = Math.min(dp[i - 1][j - 1], Math.min(dp[i][j - 1], dp[i - 1][j])) + 1; } } } return dp[m][n]; } public static void main(String[] args) { String target = "kitten"; String longString = "skittlesittingbingocat"; int threshold = 2; List<String> similarSubstrings = findSimilarSubstrings(target, longString, threshold); System.out.println("Find similar substrings:"); for (String substring : similarSubstrings) { System.out.println(substring); } } }
Overview
- Edit Distance, also known as Levenshtein distance, is a commonly used algorithm to measure the similarity between two strings. It measures the minimum number of edit operations required to convert one string into another, where edit operations can be inserting, deleting, or replacing characters.
- Some common applications include:
- Spelling correction: By calculating the edit distance between the target word and the candidate word, the closest correct spelling of the target word can be found.
- Text similarity: Edit distance can be used to compare the similarity between two pieces of text, such as matching similar keywords in text search or finding similar text in document clustering.
- Genome alignment: In bioinformatics, edit distance can be used to compare the similarity between DNA sequences or protein sequences for genome alignment and analysis.
- Speech recognition: By calculating the edit distance between speech signals, the best matching text transcription can be found in speech recognition tasks.
- Data cleaning: In data processing, edit distance can be used to clean and normalize data, such as correcting user input errors or matching similar entity names.
- Leetcode question link: https://leetcode.cn/problems/edit-distance/
{% note blud no-icon %} - The meaning of dp[i][j] is the minimum number of steps it takes to change word1[i] to word2[j]
- When word1[i] == word2[j], dp[i][j] = dp[i-1][j-1];
- When word1[i] != word2[j], dp[i][j] = min(dp[i-1][j-1], dp[i-1][j], dp[i][j -1]) + 1
- The above state transition equation can be viewed in conjunction with the following table
“” | h | o | r | s | e | |
---|---|---|---|---|---|---|
“” | 0 | 1 | 2 | 3 | 4 | 5 |
r | 1 | 1 | 2 | 2 | 3 | 4 |
o | 2 | 2 | 1 | 2 | 3 | 4 |
s | 3 | 3 | 2 | 2 | 2 | 3 |
- first row
- Changing from
""
to""
requires 0 steps - It takes 1 step to change from
""
toh
, inserting ah
- It takes 2 steps to change from
""
toho
. On the basis of 2, insert anothero
- It takes 3 steps to change from
""
tohor
. On the basis of 3, insert anotherr
- and so on
- Changing from
- first row
- Changing from
""
to""
requires 0 steps - Changing from
r
to""
requires one step, deleting ar
- It takes 2 steps to change from
ro
to""
. On the basis of 2, delete anothero
- It takes 3 steps to change from
r
to""
. On the basis of 3 steps, delete anothers
- Changing from
- diagonal
- It takes 0 steps to change from
""
to""
- It takes 1 step to change from
r
toh
, replacer
withh
- It takes 0 steps to change from
- Summary:
dp[i-1][j-1]
represents the replacement operation,dp[i-1][j]
represents the deletion operation,dp[i ][j-1]
represents the insertion operation.
{% endnote %}
class Solution { public int minDistance(String word1, String word2) { int m = word1.length(); int n = word2.length(); int[][] dp = new int[m + 1][n + 1]; // initialization for (int i = 0; i <= n; i + + ) { dp[0][i] = i; } for (int i = 0; i <= m; i + + ) { dp[i][0] = i; } // Calculate edit distance for (int i = 1; i <= m; i + + ) { for (int j = 1; j <= n; j + + ) { if (word1.charAt(i - 1) == word2.charAt(j - 1)) { dp[i][j] = dp[i - 1][j - 1]; } else dp[i][j] = Math.min(Math.min(dp[i - 1][j], dp[i][j - 1]), dp[i - 1][j - 1]) + 1; } } return dp[m][n]; } }
TODO
- A new category is added to record the algorithms encountered in the work.