Java implements text duplication checking (similarity) without third-party tool version

Functional background:

As business records gradually grow, duplicate project name data and duplicate content data gradually appear, which leads to a decline in the quality of project records. In order to avoid this situation from happening, we consider performing duplication checking on key data information. We originally planned to use a third-party standard duplication checking interface, but the process is cumbersome and requires time for business docking, so we temporarily implement data duplication checking in our own system. .

Of course, it is too troublesome to implement a standard plagiarism check system like CNKI paper plagiarism check. You can even create an independent system, so we simply implement the plagiarism check function and implement plagiarism check on names and large text content.

Duplicate checking logic:

If you want to implement the duplication check function, you must understand how to implement the duplication check logic by referring to other standard projects.

The usual approach is:

Split text: Split the text to be checked according to certain rules (such as spaces, punctuation, etc.) to obtain words, phrases or sentences in the text.
Remove stop words: Remove some common but meaningless words (such as “the”, “a”, “an”, “and”, etc.) from the segmented words to reduce misjudgments and improve efficiency.
Calculate similarity: determine whether there is duplicate content by comparing the similarity between the text to be checked and the existing text. Similarity calculation can be based on words, phrases or sentences. Commonly used algorithms include cosine similarity, Jaccard similarity, edit distance, etc.
Set threshold: Set a threshold based on actual needs to determine whether the similarity between the text to be checked and the existing text exceeds the threshold. If it is exceeded, it is considered duplicate content.
Return results: Return the judgment results to the user, usually including whether there is duplicate content, duplication rate and other information.

The other one is ?:

After reading the text data, the SimHash value is calculated respectively. Then the Hamming distance between the duplication-checking text and the original text is calculated based on SimHash, and then the similarity and duplication-checking rate are obtained.

Let’s briefly talk about the principle:

SimHash is a text feature extraction method that maps text into a fixed-digit binary sequence. This binary sequence is called a SimHash value. The principle of SimHash is to calculate the hash value of each keyword in the text, and then weight and sum these hash values to finally obtain an n-digit binary sequence.
Suppose we have two pieces of text A and B. We can calculate their SimHash values respectively, and then determine their similarity by calculating their Hamming distance. Hamming distance refers to the number of different bits at corresponding positions in two binary sequences. We can calculate the Hamming distance using the following formula:
hamming_distance = sum(1 for x, y in zip(a, b) if x != y)
Where a and b are two SimHash values respectively. Suppose we assume that the value calculated by SimHash is a 128-bit binary sequence, then the Hamming distance between them is between 0-128. The smaller the Hamming distance is, the more similar the two texts are in content, and vice versa. We can set a threshold based on the Hamming distance. For example, if the Hamming distance is less than 10, we consider the two texts to be similar. In actual applications, the setting of the threshold needs to be determined according to the specific application scenario.
For text duplication checking tasks, we can first calculate the SimHash values of all texts, and then compare the Hamming distance of each pair of SimHash values to obtain the similarity between each pair of texts. After completing this process, we can choose a similarity threshold, such as 90%, to determine which texts are duplicates and which are not.
Specifically, for a collection containing n texts, we need to calculate n*(n-1)/2 SimHash values, and then for each pair of SimHash values, calculate their Hamming distance. Finally, we can output all text pairs whose Hamming distance is less than the set threshold. These text pairs are repeated.

Interface logic:

The process of string duplication checking can generally be divided into four steps:

Read the target string and underlying record string.

Preprocess them, such as converting all letters to lowercase, removing spaces and special symbols, etc.

Apply string duplication checking algorithm to compare and obtain their similarity.

Select the maximum similarity as their final similarity result.

When preprocessing the target string and the base record string, you need to pay attention to choosing an appropriate way to normalize these strings, such as letter case, spaces, and special symbols. This helps reduce the differences between strings and improve the accuracy and efficiency of string duplication checking.

The choice of algorithm can be decided according to the specific application scenario. Common algorithms include SimHash algorithm, Jaccard similarity algorithm, Levenshtein distance algorithm, etc. These algorithms have their own advantages and disadvantages and need to be selected and tuned according to the actual situation.

Interface code:

/**
 * @Author Jiangfy
 * @Description String duplication rate calculation
 * @Param
 * @return
 **/
@Override
public ResultBean checkDuplicateRate(ServiceContext ServiceContext, String targetStr) {
    StringUtils.isBlankAssert(targetStr, "The target string for duplicate checking cannot be blank");
    ResultBean resultBean = new ResultBean();

    // 1. Query all current project name data
    List<String> projectNameList = new ArrayList<>();
    ResultBean resultBeanProjName = projectMapper.queryProjectByCond();
    JSONArray dataArray = resultBeanProjName.getJSONArray("resultset");
    for (Object obj : dataArray) {
        JSONObject jsonObj = (JSONObject) obj;
        String projectName = jsonObj.getString("name");
        projectNameList.add(projectName);
    }

    // 2. Preprocess the name and basic data first
    String finalTargetStr = StringUtils.replaceAll(targetStr, "[^a-zA-Z0-9\\一-\\龥]", "").toLowerCase();
    List<String> processedList = projectNameList.stream()
            .map(str -> StringUtils.replaceAll(str, "[^a-zA-Z0-9\\一-\\龥]", "").toLowerCase())
            .collect(Collectors.toList());

    // 3. Use algorithms to circularly compare basic data
    AtomicReference<Double> maxSimilarity = new AtomicReference<>(0.0);
    processedList.parallelStream().forEach(name -> {
        double sim = Util.similarity(finalTargetStr, name, false);
        maxSimilarity.updateAndGet(current -> Double.compare(sim, current) > 0 ? sim : current);
    });

    //Format to two decimal places
    DecimalFormat df = new DecimalFormat("0.00"); // Keep two decimal places
    double roundedValue = Double.parseDouble(df.format(maxSimilarity.get())) * 100; // Get the value with two decimal places and convert the percentage

    // 4. Get the final result with the highest similarity value
    resultBean.setSuccess();
    resultBean.addParam("maxSimilarity", roundedValue);
    return resultBean;
}

Calculate similarity tool method:

/**
 * @Author Jiangfy
 * @Description //Lewenstein distance algorithm to calculate the similarity between two strings
 * @Param [strA, strB, isRegex]
 * @return double
 **/
public static double similarity(String strA, String strB,boolean isRegex) {
    // Remove non-letters, numbers, and Chinese characters from the string
    if (isRegex){
        strA = strA.replaceAll("[^a-zA-Z0-9\\一-\\龥]", "");
        strB = strB.replaceAll("[^a-zA-Z0-9\\一-\\龥]", "");
    }

    // Calculate Levenstein distance
    int[][] distance = new int[strA.length() + 1][strB.length() + 1];
    for (int i = 0; i <= strA.length(); i + + ) {
        distance[i][0] = i;
    }
    for (int j = 0; j <= strB.length(); j + + ) {
        distance[0][j] = j;
    }
    for (int i = 1; i <= strA.length(); i + + ) {
        for (int j = 1; j <= strB.length(); j + + ) {
            if (strA.charAt(i - 1) == strB.charAt(j - 1)) {
                distance[i][j] = distance[i - 1][j - 1];
            } else {
                distance[i][j] = Math.min(distance[i - 1][j] + 1, Math.min(distance[i][j - 1] + 1, distance[i - 1][j - 1 ] + 1));
            }
        }
    }

    // Calculate similarity
    int maxLen = Math.max(strA.length(), strB.length());
    return (maxLen - distance[strA.length()][strB.length()]) / (double) maxLen;
}

Text plagiarism checking usually includes the following steps:

Obtain target text and basic duplication check data.

Preprocess text, including word segmentation, sentence segmentation, lowercase, and space removal.

Use algorithms (such as Jaccard) to calculate the similarity between texts.

The maximum similarity is selected as their final similarity result.

When preprocessing, appropriate methods should be selected to standardize text according to the actual situation in order to improve the accuracy of duplication checking. In terms of algorithm selection, it also needs to be evaluated and tuned according to specific application scenarios to achieve the best results.

Interface code:

 /**
 * @Author Jiangfy
 * @Description Check text duplication rate
 * @Param targetStr
 * @return
 **/
@Override
public ResultBean calculateDuplicationRate(ServiceContext ServiceContext, String targetStr){
    StringUtils.isBlankAssert(targetStr, "The target string for duplicate checking cannot be blank");
    ResultBean resultBean = new ResultBean();
    //1. Query all current project name data
    List<String> abstractStrList = new ArrayList<>();
    ResultBean resultBeanAbstract = projectServiceMapper.queryProjectByCond();
    JSONArray dataArray = resultBeanAbstract.getJSONArray("resultset");
    for (Object obj : dataArray) {
        JSONObject jsonObj = (JSONObject) obj;
        String abstractStr = jsonObj.getString("abstract");
        abstractStrList.add(abstractStr);
    }
    // Use removeIf and StringUtils to remove empty strings
    abstractStrList.removeIf(uString::isBlank);
    //2. Segment sentences based on punctuation marks. The original plan here was to use a three-party word segmentation tool to temporarily segment sentences based on punctuation marks.
    Set<String> finalTargetStrSet = TechUtil.splitSentences(targetStr);
    Set<Set<String>> processedSet = TechUtil.splitSentences(new HashSet<>(abstractStrList));
    //todo Consider putting the processedSet results into the cache and setting the expiration time to avoid repeated database queries.
    //3. Use the algorithm to circularly compare the two sets of text basic data.
    AtomicReference<Double> maxSimilarity = new AtomicReference<>(0.0);
    processedSet.parallelStream().forEach(set -> {
        double sim = Util.calculateJaccardSimilarity(finalTargetStrSet, set);
        maxSimilarity.updateAndGet(current -> Double.compare(sim, current) > 0 ? sim : current);
    });
    //Format to keep two decimal places
    DecimalFormat df = new DecimalFormat("0.00"); // Keep two decimal places
    double roundedValue = Double.parseDouble(df.format(maxSimilarity.get())) * 100; // Get the value with two decimal places and convert the percentage
    //4. Get the final result with the highest similarity value
    resultBean.setSuccess();
    resultBean.addParam("maxSimilarity",roundedValue);
    return resultBean;
}

Text segmentation tool methods:

/**
 * @Author Jiangfy
 * @Description Text string is divided into sentences according to punctuation marks
 * @Param [text]
 * @return java.util.Set<java.lang.String>
 **/
public static Set<String> splitSentences(String text) {
    // Define regular expressions to match punctuation marks in sentences
    String regex = "[.!?,;:""'[]《》()\[\]{}.,;:"'?!]";

    // Use regular expressions to segment the string
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    int start = 0;
    Set<String> sentences = new HashSet<>();

    while (matcher.find()) {
        String sentence = text.substring(start, matcher.start()).trim();
        if (!sentence.isEmpty()) {
            sentences.add(sentence.toLowerCase().replaceAll("\s + ", ""));
        }
        start = matcher.end();
    }

    // Process the last sentence
    String lastSentence = text.substring(start).trim();
    if (!lastSentence.isEmpty()) {
        sentences.add(lastSentence.toLowerCase().replaceAll("\s + ", ""));
    }

    return sentences;
}


/**
 * @Author Jiangfy
 * @Description Text collection is segmented according to punctuation marks
 * @Param [inputSet]
 * @return java.util.Set<java.util.Set<java.lang.String>>
 **/
public static Set<Set<String>> splitSentences(Set<String> inputSet) {
    return inputSet.stream()
            .map(str -> Arrays.stream(str.split("[.!?,;:""''[]《》()\[\]{}.,;:\ "'?!]"))
                    .map(String::toLowerCase)
                    .map(sentence -> sentence.replaceAll("\s + ", ""))//Remove spaces
                    .collect(Collectors.toSet()))
            .collect(Collectors.toSet());
}

Jaccard’s similarity calculation method:

/**
 * @Author Jiangfy
 * @Description Jaccard calculates similarity double
 * @Param [set1, set2]
 * @return double
 **/
public static double calculateJaccardSimilarity(Set<String> set1, Set<String> set2) {
    if (set1.size() == 0 & amp; & amp; set2.size() == 0) {
        return 1.0;
    }
    // All are empty, similarity is 1
    if (set1.size() == 0 || set2.size() == 0) {
        return 0.0;
    }
    Set<String> intersection = new HashSet<>(set1);
    intersection.retainAll(set2); // Calculate intersection

    Set<String> union = new HashSet<>(set1);
    union.addAll(set2); // Calculate union

    return (double) intersection.size() / union.size();
}