Sensitive word filtering based on deterministic finite automata (DFA algorithm)

1. Introduction to DFA algorithm

DFA (Deterministic Finite Automaton) is a non-recursive automaton, also known as deterministic finite automaton. It gets nextstate through event and current state, that is, event + state=nextstate.

Determinable: Both states and the events that cause state transitions are determinable.

finite: The number of states and the events that cause state transitions are exhaustive.

For the following state transition diagram:

(S,a) -> U
(S,b) -> V
(U,a) -> Q
(U,b) -> V
(V,a) -> U
(V,b) -> Q
(Q,a) -> Q
(Q,b) -> Q

image-20230601175533650

We can use each text fragment as a state, for example, “matching keyword” can be split into five text fragments of “match”, “match”, “matching off”, “matching key” and “matching keyword”.

image-20230601175948960

process:

  • The initial state is empty, when the event “horse” is triggered, it will transition to the state “horse”;
  • Trigger the event “match” and transition to the state “match”;
  • And so on, until the transition to the last state “matching keywords”.

Let us consider the case of multiple keywords, such as “matching algorithm”, “matching keyword” and “information extraction”.

image-20230601180036879

It can be seen that the state diagram in the above figure is similar to a tree structure, and it is precisely because of this structure that the DFA algorithm is faster than the keyword iteration method (for loop) in terms of keyword matching.

2. Java’s implementation of DFA algorithm

image-20230601175533650

The key to realize sensitive word filtering in Java is the realization of DFA algorithm.

We can think that through S query U, V, through U query V, P, through V query U P. Through such a transformation we can transform the transition of the state into a lookup using Java collection.

If the following words are sensitive words: Japanese, Japanese devil

image-20230601180757624

In this way, we build our sensitive thesaurus into a tree similar to one by one, so that when we judge whether a word is a sensitive word, the matching range of the search is greatly reduced. For example, if we want to judge Japanese, we can confirm the tree that needs to be searched according to the first word, and then search in this tree.

But how to judge that a sensitive word is over? Use the identification bit to judge.

So the key to this is how to build such sensitive word trees. Below I take the HashMap in Java as an example to implement the DFA algorithm. Taking the Japanese and Japanese devils as an example, the specific process is as follows:

  1. First, get the HashMap of the root node, and judge whether “day” exists in the root node. If it does not exist, it means that the sensitive word does not exist yet. Then use “day” as the key and create a new HashMap as the value as the root node.
  2. If it exists, it means that there are already sensitive words starting with “day”. Set hashMap = hashMap.get(“day”), and then match “this” and “person” in turn.
  3. Determine whether the word is the last word in the word. If it indicates the end of the sensitive word, set the flag bit isEnd = 1, otherwise set the flag bit isEnd = 0;

image-20230601233249710

3.Java implementation case of DFA algorithm

If “Japanese” and “Japanese devils” are used as sensitive words, the following is the implementation process

image-20230602082834955

image-20230601180757624

Code:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class DFAHashMap {<!-- -->
    public void createDFAHashMap(List<String> strings) {<!-- -->
// 1. Create dfaHashMap, which represents a tree structure.
// Specify the capacity as the length of the collection strings to avoid dynamic expansion of dfaHashMap when adding data
        HashMap<String, Object> dfaHashMap = new HashMap<>(strings. size());
// 2. Create a temporaryHashMap for loading temporary HashMap data
        HashMap<String, Object> temporaryHashMap = new HashMap<>(strings. size());
// 3. Traverse the string
        for (String string : strings) {<!-- -->
// 3.1 For each string, the first temporaryHashMap is dfaHashMap, which is the root node of the tree structure.
// That is, temporaryHashMap first points to the memory address of the root node dfaHashMap.
            temporaryHashMap = dfaHashMap;
// 4. Traverse each character in the string
            for (int i = 0; i < string. length(); i ++ ) {<!-- -->
// 5. Query whether the current character exists in the current temporaryHashMap
                String word = String. valueOf(string. charAt(i));
                HashMap<String, Object> resultHashMap = (HashMap<String, Object>) temporaryHashMap.get(word);
                if (resultHashMap == null) {<!-- -->
// 6. If the current character does not exist in the current dfaHashMap, use the current character as the key, and create a new HashMap as the Value
                    resultHashMap = new HashMap<String, Object>();
// 6.1 Since temporaryHashMap points to the memory address of dfaHashMap, the value is stored in dfaHashMap
                    temporaryHashMap.put(word, resultHashMap);
                }
// 7. Point the address of temporaryHashMap to the next HashMap
                temporaryHashMap = resultHashMap;
// 8. Determine whether to skip this loop
// If there is already isEnd in the temporaryHashMap, and it is 1, it means that the sensitive word already exists in the tree structure, and isEnd is no longer set
// Such as Japan and Japs, first set Japan
// When the Japanese devils set it, the corresponding map has isEnd=1, if it is overwritten at this time, it will be isEnd=0, causing the Japanese keyword to fail
                if(temporaryHashMap.containsKey("isEnd") & amp; & amp;temporaryHashMap.get("isEnd").equals(1)){<!-- -->
                    continue;
                }
// 8. Encapsulate temporaryHashMap
// 8.1 Determine whether the current character is the last character of the string
                if (i == string. length() - 1) {<!-- -->
                    temporaryHashMap.put("isEnd", 1);
                } else {<!-- -->
                    temporaryHashMap. put("isEnd", 0);
                }
            }
        }
        System.out.println(dfaHashMap);
    }
    public static void main(String[] args) {<!-- -->
        List<String> strings = new ArrayList<>();
        strings.add("Japanese");
        strings.add("Jap");
        new DFAHashMap().createDFAHashMap(strings);
    }
}

The result is as follows:

image-20230601200420433

{day=
{this=
{person={isEnd=1},
         ghost =
         {child={isEnd=14},
         isEnd=0},
     isEnd=0},
  isEnd=0}}

4. Comprehensive actual combat

4.1 Sensitive lexicon initialization class

The function of this class is mainly to read the sensitive word file and use the DFA algorithm to initialize the sensitive word database.

  • public HashMap getSensitiveWordHashMap():

    Call the readSensitiveWordFile() method to obtain the sensitive word set, and then call the initSensitiveHashMap() method to initialize the sensitive word library using the DFA algorithm.

  • private Set readSensitiveWordFile()

    Read the sensitive word file and return the sensitive word set

  • private Map initSensitiveHashMap(Set strings):

    Encapsulate the collection of sensitive words into a HashMap type, and initialize the sensitive word library.

import lombok.extern.slf4j.Slf4j;
import org.springframework.core.io.ClassPathResource;

import java.io.*;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * Sensitive lexicon initialization
 *
 * @date 2023/06/02
 */
@Slf4j
public class SensitiveWordInitialize {<!-- -->

    // Character Encoding
    private String ENCODING = "UTF-8";

    /**
     * Sensitive thesaurus variables
     */
    private HashMap<String, Object> sensitiveWordHashMap = null;

    /**
     * Access to the sensitive vocabulary
     *
     * @return {@link HashMap}<{@link String}, {@link Object}>
     * @throws IOException ioexception
     */
    public HashMap<String, Object> getSensitiveWordHashMap() throws IOException {<!-- -->
        log.info("Sensitive vocabulary initialization");
        Set<String> strings = readSensitiveWordFile();
        System.out.println(strings.size());
        sensitiveWordHashMap = (HashMap<String, Object>) initSensitiveHashMap(strings);
        return sensitiveWordHashMap;
    }

    /**
     * Sensitive word file read
     *
     * @return {@link Set}<{@link String}>
     */
    private Set<String> readSensitiveWordFile() {<!-- -->
        Set<String> wordSet = null;
        ClassPathResource classPathResource = new ClassPathResource("sensitive_words.txt");
// 1. Get the file byte stream
        InputStream inputStream = null;
        InputStreamReader inputStreamReader = null;
        try {<!-- -->
            inputStream = classPathResource. getInputStream();
// 2. Realize conversion between byte stream and character stream
            inputStreamReader = new InputStreamReader(inputStream, ENCODING);
            wordSet = new HashSet<>();
// 3. Use the buffered stream to read
            BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
            String txt = null;
            while ((txt = bufferedReader. readLine()) != null) {<!-- -->
                wordSet. add(txt);
            }
        } catch (IOException e) {<!-- -->
            throw new RuntimeException(e);
        } finally {<!-- -->
            if (inputStreamReader != null) {<!-- -->
                try {<!-- -->
                    inputStreamReader. close();
                } catch (IOException e) {<!-- -->
                    e.printStackTrace();
                }
            }
            if (inputStream != null) {<!-- -->
                try {<!-- -->
                    inputStream. close();
                } catch (IOException e) {<!-- -->
                    e.printStackTrace();
                }
            }
        }
        return wordSet;
    }


    /**
     * Initialize sensitive thesaurus based on DFA algorithm
     *
     * @param strings string
     */
    private Map initSensitiveHashMap(Set<String> strings) {<!-- -->
        sensitiveWordHashMap = new HashMap<>(strings. size());
// 1. Create a temporaryHashMap for storing temporary data
        HashMap<String, Object> temporaryHashMap = new HashMap<>();
// 2. Traverse the list of sensitive words
        for (String string : strings) {<!-- -->
// 3. For temporaryHashMap, its initial value is the root node sensitiveWordHashMap
            temporaryHashMap = sensitiveWordHashMap;
// 4. Traverse each character
            for (int i = 0; i < string. length(); i ++ ) {<!-- -->
                String word = String. valueOf(string. charAt(i));
// 4.1 Determine whether the character exists according to the point
                HashMap<String, Object> resultHashMap = (HashMap<String, Object>) temporaryHashMap.get(word);
// 4.2 If it is empty, create a node
                if (resultHashMap == null) {<!-- -->
                    resultHashMap = new HashMap<String, Object>();
// 4.3 Create a node with the current character as the key and new HashMap as the value
                    temporaryHashMap.put(word, resultHashMap);
                }
// 5.temporaryHashMap points to the next HashMap
                temporaryHashMap = resultHashMap;
// 6. Determine whether to skip this loop
// If there is already isEnd in the temporaryHashMap, and it is 1, it means that the sensitive word already exists in the tree structure, and isEnd is no longer set
// Such as Japan and Japs, first set Japan
// When the Japanese devils set it, the corresponding map has isEnd=1, if it is overwritten at this time, it will be isEnd=0, causing the Japanese keyword to fail
                if (temporaryHashMap.containsKey("isEnd") & amp; & amp; temporaryHashMap.get("isEnd").equals(1)) {<!-- -->
                    continue;
                }
// 7. Encapsulate temporaryHashMap
// 7.1 Determine whether the current character is the last character of the string
                if (i == string. length() - 1) {<!-- -->
                    temporaryHashMap.put("isEnd", 1);
                } else {<!-- -->
                    temporaryHashMap. put("isEnd", 0);
                }
            }
        }
        return sensitiveWordHashMap;
    }
}

4.2 Sensitive word tools

This class adopts the singleton mode to obtain the sensitive vocabulary, and filters and replaces the sensitive words on the string.

  • private SensitiveReplaceUtil():

    Call the sensitive lexicon initialization class in the constructor to get the sensitive lexicon.

  • public static SensitiveReplaceUtil getInstance():

    Adopt the singleton mode to ensure that there is only one instance of this class.

  • public String replaceSensitiveWord(String string, int matchType):

    Replace sensitive words in a string. First call the getSensitiveWord() method to obtain the sensitive word set in the current string, and traverse the sensitive word set. Call the getReplaceString method to generate the corresponding replacement word according to the length of the sensitive word. Replace the current sensitive word.

  • private String getReplaceString(int length)

    Generate the corresponding replacement string according to the length of the sensitive word.

  • public Set getSensitiveWord(String string, int matchType):

    Get the set of sensitive words. In this method, according to the string entered by the user, call the getStringLength() method to obtain the index position of the sensitive word in this string. Intercept sensitive words and add them to the sensitive word set and return.

  • public int getStringLength(String string, int beginIndex, int matchType):

    Returns the sensitive word length. In this method, the sensitive lexicon is obtained, and according to the sensitive lexicon , check whether the text contains sensitive characters. If it exists, return the length of the sensitive word character, and return 0 if it does not exist.

import org.springframework.stereotype.Component;

import java.io.IOException;
import java.util.*;

/**
 * Sensitive word filtering tools
 *
 * @author Xu huaiang
 * @date 2023/06/02
 */
@Component
public class SensitiveReplaceUtil {<!-- -->
    /**
     * Sensitive word filter: use DFA algorithm to filter sensitive words
     */
    private static HashMap<String, Object> sensitiveWordHashMap = null;

    /**
     * singleton
     */
    private static SensitiveReplaceUtil instance = null;

    /**
     * Minimum matching rules, such as: sensitive lexicon ["China","Chinese"], sentence: "I am Chinese", matching result: I am from [China]
     */
    public static int minMatchType = 1;
    /**
     * Maximum matching rules, such as: sensitive thesaurus ["China","Chinese"], sentence: "I am Chinese", matching result: I am [Chinese]
     */
    public static int maxMatchType = 2;
    /**
     * Sensitive word replacement word
     */
    public static String replaceChar = "*";

   /**
     * Initialize the sensitive vocabulary in the constructor
     *
     * @throws IOException
     */
    private SensitiveReplaceUtil() throws IOException {<!-- -->
        sensitiveWordHashMap = new SensitiveWordInitialize().getSensitiveWordHashMap();
    }
    
    /**
     * Get singleton object
     *
     * @return {@link SensitiveReplaceUtil}
     * @throws IOException ioexception
     */
    public static SensitiveReplaceUtil getInstance() throws IOException {<!-- -->
        if (instance == null) {<!-- -->
            instance = new SensitiveReplaceUtil();
        }
        return instance;
    }


    /**
     * Replace sensitive words in the string
     *
     * @param string string
     * @param matchType match type
     * @return {@link String}
     */
    public String replaceSensitiveWord(String string, int matchType) {<!-- -->
        String resultString = string;
// 1. Get the set of sensitive words in the current string
        Set<String> sensitiveWord = getSensitiveWord(string, matchType);
// 2. Iterate through the sensitive word set
        Iterator<String> iterator = sensitiveWord. iterator();
        while (iterator.hasNext()) {<!-- -->
// 2.1 Get sensitive words
            String word = iterator. next();
// 2.2 Create a replacement string based on the length of the sensitive word
            String replaceString = getReplaceString(word. length());
// 2.3 replace the string
            resultString = resultString. replaceAll(word, replaceString);
        }
        return resultString;
    }

    /**
     * Create a replacement string based on the length of the sensitive word
     *
     * @param length length
     * @return {@link String}
     */
    private String getReplaceString(int length) {<!-- -->
        StringBuilder replaceString = new StringBuilder();
// Create a replacement string based on the length of the sensitive word
        for (int i = 0; i < length; i ++ ) {<!-- -->
            replaceString.append(replaceChar);
        }
        return replaceString.toString();
    }

    /**
     * Get the set of sensitive words in the string
     *
     * @param string string
     * @param matchType match type
     * @return {@link Set}<{@link String}>
     */
    public Set<String> getSensitiveWord(String string, int matchType) {<!-- -->
        Set<String> set = new HashSet<>();
// 1. Traverse each character in the string
        for (int i = 0; i < string. length(); i ++ ) {<!-- -->
            int length = getStringLength(string, i, matchType);
// 2. If the length is greater than 0, it means that there are sensitive words, and add the sensitive words to the collection
            if (length > 0) {<!-- -->
                set. add(string. substring(i, i + length));
            }
        }
        return set;
    }


    /**
     * Check whether the text contains sensitive characters, the checking rules are as follows:
     * If it exists, return the length of the sensitive word character, if it does not exist, return 0
     *
     * @param string string
     * @param beginIndex start indexing
     * @param matchType match type
     * @return int
     */
    public int getStringLength(String string, int beginIndex, int matchType) {<!-- -->
// 1. The length of the current sensitive word, used for accumulation
        int nowLength = 0;
// 2. The final sensitive word length
        int resultLength = 0;
// 3. Obtain the sensitive thesaurus
        HashMap<String, Object> temporaryHashMap = sensitiveWordHashMap;
// 4. Traverse the string
        for (int i = beginIndex; i < string. length(); i ++ ) {<!-- -->
// 5. Determine whether the current character is the first letter in the sensitive thesaurus
            String word = String. valueOf(string. charAt(i));
            temporaryHashMap = (HashMap<String, Object>) temporaryHashMap. get(word);
// 5.1 If it is empty, it means that the current character is not a sensitive word
            if (temporaryHashMap == null) {<!-- -->
                break;
            } else {<!-- -->
                nowLength++;
// 5.2 Determine whether it is the last sensitive word character
                if (temporaryHashMap.get("isEnd").equals(1)) {<!-- -->
                    resultLength = nowLength;
// 5.3 Determine the matching principle, if it is 2, it means the maximum matching principle, continue to match, and end the match if it is 1
                    if (matchType == minMatchType) {<!-- -->
                        break;
                    }
                }
            }
        }
        return resultLength;
    }
}

4.3 Testing

Create a request mapping at the Controller layer to simulate user query requests.

import com.vector.score.utils.SensitiveReplaceUtil;
import lombok.extern.slf4j.Slf4j;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.io.IOException;

@Slf4j
@RestController
@RequestMapping("/sensitive")
public class SensitiveTestController {<!-- -->

    @GetMapping("/test")
    public void test(@RequestParam("string") String string) throws IOException {<!-- -->
        String word = SensitiveReplaceUtil.getInstance().replaceSensitiveWord(string, 2);
        log.info("The string before replacement is: " + string);
        log.info("The string after replacement is: " + word);
    }
}

image-20230602191447340

View the console print:

image-20230602191747193
Reference Blog 1
Reference Blog 2