Self-sensitive word filtering – using tool classes to implement – finite automata – DFA algorithm

Article directory

  • 1. Implementation and introduction of self-managed sensitive word review
  • 2. Usage steps
    • 1. Tools
    • 2. Use tools

Tips: The following is the text of this article, the following cases are for reference

1. Implementation and principle of self-managed sensitive lexicon

For example, media companies, in addition to using third-party sensitive word review, each platform will also have its own sensitive word database.

1. Implementation process:

2. Usage steps

1. Algorithm implementation: introducing tool classes

The code is as follows (example): take it away and use it directly. The algorithm implementation has no dependencies.

package com.ciels.utils.common;


import java.util.*;

public class SensitiveWordUtil {

    public static Map<String, Object> dictionaryMap = new HashMap<>();


    /**
     * Generate keyword dictionary library
     * @param words
     * @return
     */
    public static void initMap(Collection<String> words) {
        if (words == null) {
            System.out.println("The sensitive word list cannot be empty");
            return ;
        }

        //The initial length of the map is words.size(), the number of entry words in the entire dictionary (less than words.size(), because different words may have the same first word)
        Map<String, Object> map = new HashMap<>(words.size());
        //Data at the current level during the traversal process
        Map<String, Object> curMap = null;
        Iterator<String> iterator = words.iterator();

        while (iterator.hasNext()) {
            String word = iterator.next();
            curMap = map;
            int len = word.length();
            for (int i =0; i < len; i + + ) {
                //Loop through the characters of each word
                String key = String.valueOf(word.charAt(i));
                // Whether the current word exists in the current layer, if it does not exist, create a new one. The current layer data points to the next node, and continues to determine whether the data exists.
                Map<String, Object> wordMap = (Map<String, Object>) curMap.get(key);
                if (wordMap == null) {
                    //There are two data for each node: next node and isEnd (whether it is the end flag)
                    wordMap = new HashMap<>(2);
                    wordMap.put("isEnd", "0");
                    curMap.put(key, wordMap);
                }
                curMap = wordMap;
                // If the current word is the last word of the word, set the isEnd flag to 1
                if (i == len -1) {
                    curMap.put("isEnd", "1");
                }
            }
        }

        dictionaryMap = map;
    }

    /**
     * Search whether a certain text in the text matches the keyword
     * @param text
     * @param beginIndex
     * @return
     */
    private static int checkWord(String text, int beginIndex) {
        if (dictionaryMap == null) {
            throw new RuntimeException("Dictionary cannot be empty");
        }
        boolean isEnd = false;
        int wordLength = 0;
        Map<String, Object> curMap = dictionaryMap;
        int len = text.length();
        // Match starting from beginIndex of the text
        for (int i = beginIndex; i < len; i + + ) {
            String key = String.valueOf(text.charAt(i));
            // Get the next node of the current key
            curMap = (Map<String, Object>) curMap.get(key);
            if (curMap == null) {
                break;
            } else {
                wordLength + + ;
                if ("1".equals(curMap.get("isEnd"))) {
                    isEnd = true;
                }
            }
        }
        if (!isEnd) {
            wordLength = 0;
        }
        return wordLength;
    }

    /**
     * Get matching keywords and number of hits
     * @param text
     * @return
     */
    public static Map<String, Integer> matchWords(String text) {
        Map<String, Integer> wordMap = new HashMap<>();
        int len = text.length();
        for (int i = 0; i < len; i + + ) {
            int wordLength = checkWord(text, i);
            if (wordLength > 0) {
                String word = text.substring(i, i + wordLength);
                //Add keyword matching times
                if (wordMap.containsKey(word)) {
                    wordMap.put(word, wordMap.get(word) + 1);
                } else {
                    wordMap.put(word, 1);
                }

                i + = wordLength - 1;
            }
        }
        return wordMap;
    }

    public static void main(String[] args) {
        List<String> list = new ArrayList<>();
        list.add("Falun");
        list.add("Falun Gong");
        list.add("methamphetamine");
        initMap(list);
        String content="I am a good person, I do not sell methamphetamine, nor do I practice Falun Gong. I really do not sell methamphetamine";
        Map<String, Integer> map = matchWords(content);
        System.out.println(map);
    }
}

2. How to use tool classes

Because sensitive words are used very frequently, we added self-managed sensitive words to Redis. To improve efficiency, when modifying self-managed sensitive words in the database, the data in the Redis cache needs to be updated at the same time. (These operations are not done)

The code is as follows (example):

String text Pass in the text to be reviewed
WmNews wmNews Pass in the article object to be reviewed.
The status needs to be updated in the future–>status article publishing status, reason reason for failure to pass the review

private boolean scanSensitive(String text, WmNews wmNews) {
        //Get a list of all sub-management sensitive words, query performance from the database is low. Using the set structure of redis
        //.members gets all data
        Set<String> sensitives = redisTemplate.opsForSet().members("wmnews:sensitive");
        if (CollectionUtils.isEmpty(sensitives)){
            //query from database without cache
            List<WmSensitive> wmSensitivesList = wmSensitiveMapper.selectList(null);
            if (CollectionUtils.isEmpty(wmSensitivesList)){
                //No sensitive words, approved
                return true;
            }
            //Convert wmSensitivesList into a list of sensitive words
            sensitives=wmSensitivesList.stream().map(WmSensitive::getSensitives).collect(Collectors.toSet());
            //Add to redis cache: Because adding to set requires array type parameters, convert the sensitive word list into an array and then add it to redis.
            String[] array = sensitives.toArray(new String[sensitives.size()]);
            redisTemplate.opsForSet().add("wmnews:sensitive",array);
        }
        //When using self-managed sensitive words for review: check whether the text contains any sensitive words.
        //Implemented using DFA finite automaton. There is no need to use str.contains() to judge word by word, which improves efficiency -----Use tool classes to implement it.
        if (CollectionUtils.isEmpty(SensitiveWordUtil.dictionaryMap)){
            SensitiveWordUtil.initMap(sensitives);//Optimize without initializing this collection every time. However, the collection needs to be reloaded after the sensitive words are modified.
        }
        Map<String, Integer> resultMap = SensitiveWordUtil.matchWords(text);
        if (resultMap.size()>0){
            //If there are sensitive words, it will not be passed. Status and reasons for needing to update WnNews
            wmNews.setStatus(WmNews.Status.FAIL.getCode());
            wmNews.setReason("There are sensitive words in the text: " + resultMap.keySet());
            wmNewsMapper.updateById(wmNews);
            return false;
        }
        return true;
    }

3. Examples of tables used:

4. Some API summary

 1.CollectionUtils.isEmpty(collection);//Determine whether the collection is empty
        2.SetOperations<String, String> set = redisTemplate.opsForSet();//Get the operation redis set
          object of type
         Set<String> setMembers = set.members("wmnews:sensitive");//Get all values of the set set with key
        String[] strings = sensitives.toArray(new String[sensitives.size()]);
        set.add("wmnews:sensitive",strings);//You can pass in an array when storing set.
            add(String key,String...value) variable parameters can be passed into an array for use