Calculate text similarity and output the n highest similarities

Directory

  • Configuration
    • Create a virtual environment
    • download
  • TF
    • concept
    • code
  • word2vec
    • concept
    • Model
    • code
    • result
  • SpaCy
    • concept
    • Model
    • code
    • result
  • Bert
    • concept
    • Model
    • code
    • result
  • Compared

Configuration

Create a virtual environment

python3.9

conda create -n py39 python=3.9
conda activate py39

Download

pip install -r D:\myfile\jpy\py\000rec\install\requirements.txt
cx-Oracle==8.3.0
pandas==2.1.1
jieba==0.42.1
joblib==1.2.0
gensim==4.3.0
scikit-learn==1.3.0
tqdm==4.65.0
sqlalchemy==2.0.21
spacy==3.5.3
zeep==4.2.1
transformers==4.32.1
pytorch==2.0.1
widgetsnbextension==4.0.9
ipywidgets==8.1.1
ipykernel==6.25.0
python -m spacy download zh_core_web_sm
or
pip install D:\myfile\jpy\py\000rec\install\zh_core_web_sm-3.5.0.tar.gz

https://github.com/explosion/spacy-models/releases?q=zh_core_web_sm &expanded=true

python -m ipykernel install --user --name py39

TFidf

Concept

TF-IDF is a common technology used for text data analysis and information retrieval. Its full name is “Term Frequency-Inverse Document Frequency”. It uses statistical methods to evaluate the importance of each word in the text, and is mainly used in the following two areas:

  1. Text retrieval: TF-IDF is used to determine the relevance of documents under given query conditions, thereby assisting search engines in ranking search results. It analyzes the importance of the words in the query within the document to improve the ranking of relevant documents.

  2. Text Mining: TF-IDF can also be used for text mining tasks, such as document classification, clustering and keyword extraction. It helps identify the most important words in a document and helps in understanding the document content and topic.

The calculation of TF-IDF involves two core concepts:

  • Term Frequency (Term Frequency, TF): measures the frequency of a word in a document. Words that appear more often have higher TF values.

  • Inverse Document Frequency (Inverse Document Frequency, IDF): Considering the distribution of words in the entire document collection, reduce the weight of common words and increase the weight of rare words.

TF-IDF is calculated by multiplying the TF value of a word by its IDF value to obtain the importance score of the word in the document collection. Generally, the higher the TF-IDF score, the greater the importance of the word in the document. The formula of TF-IDF is as follows:

T

F

i

d

f

=

T

F

?

I

D

F

TFidf = TF*IDF

TFidf=TF?IDF

TF-IDF is often used for text preprocessing and feature engineering so that machine learning algorithms can better process text data. It is one of the very useful tools in text analysis and information retrieval.

Why do we need to calculate IDF (Inverse Document Frequency) after calculating TF (Term Frequency)?

  1. Emphasis on importance: TF measures the frequency of a word in a single document, but it is not enough to accurately assess the importance of a word in an entire collection of documents. Common words may appear frequently in a single document, but they do not have special importance in the entire collection of documents. By considering the distribution of words in the document collection, IDF increases the weight of rare words and reduces the weight of common words, thereby better reflecting the importance of words.

  2. Reducing noise: Text data often contains a lot of noise, such as stop words (such as “and”, “the”, etc.) and common words, which appear in most documents. TF-IDF can reduce the impact of these noises because they usually have low IDF values and therefore have lower weight in the TF-IDF calculation.

  3. Adapt to different document collections: The flexibility of TF-IDF enables it to adapt to document collections in different fields or topics. Different fields may have different common words and important words. IDF can adjust the weight of words according to the characteristics of the document collection to make it more suitable for specific fields or topics.

  4. Feature selection: In text analysis tasks, TF-IDF can be used to select the most representative feature words. Words with high TF-IDF values are generally more suitable for use as features because they are more likely to be related to the topic or category of the document, helping to improve the performance of text classification, clustering, and information retrieval.

Overall, the combination of TF and IDF enables TF-IDF to more accurately measure the importance of words, emphasize key words, reduce noise interference, adapt to different document collections, and be used for feature selection to improve the quality and effect of text data analysis. .

Code

import jieba
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# sentence
df = pd.DataFrame({<!-- -->'id': ['a', 'b', 'c', 'd', 'e' ],
                   'Sentences':["The weather is really nice today, the sun is shining brightly.",
                                "Keyword matching is a common text processing task.",
                                "The computer does not understand human language and needs to be converted into word vectors.",
                                "Prosperity, democracy, civilization, harmony, freedom, equality, justice, rule of law, patriotism, dedication, integrity, and friendliness.",
                                "Chinese word segmentation tools are very helpful for text processing.",]})

# Input query sentence
query_sentence = "Keyword matching and text processing tasks"

# Segment words and create TF-IDF feature vectors
def preprocess(text):
    words = jieba.lcut(text)
    return " ".join(words)

df["Preprocessed_Sentence"] = df["Sentences"].apply(preprocess)
query_sentence = preprocess(query_sentence)

# Calculate similarity
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(list(df["Preprocessed_Sentence"]) + [query_sentence])
similarities = cosine_similarity(tfidf_matrix)
query_similarity = similarities[-1, :-1] #The last line is the query sentence

# Get the n sentences with the highest similarity
n = 10 #The number of similar sentences you want to get

result_df = df.copy()
result_df.set_index('id', inplace=True)
result_df['Similarity_Score'] = query_similarity
result_df = result_df.sort_values(by='Similarity_Score', ascending=False)#.reset_index(drop=True)
print('Similarity sorting:', result_df.index.to_list())
result_df[['Sentences', 'Similarity_Score']][:n]

result

word2vec

Concept

Word2Vec is a word embedding (Word Embedding) model. Its core principle is to map words into a continuous real vector space so that the semantic information of the word can be captured in the vector space. There are two main variants of the Word2Vec model: Continuous Bag of Words (CBOW) and Skip-gram. The basic principles of these two variants are introduced below:

  1. Continuous Bag of Words (CBOW):

    • In the CBOW model, given a target word, the model attempts to predict the target word from its surrounding context words.
    • The input of CBOW is the average of the word embedding vectors of the context words within a context window.
    • The output of the model is the word embedding vector of the target word.
    • During the training process, the model learns word embeddings of words by maximizing the conditional probability of predicting the target word, so that the conditional probability of the target word given by the context words is maximized.
  2. Skip-gram:

    • In the Skip-gram model, as opposed to CBOW, given a target word, the model tries to predict the context words around it.
    • The input of Skip-gram is the word embedding vector of the target word.
    • The output of the model is the word embedding vector of the context words surrounding the target word.
    • During training, the model learns word embeddings for words by maximizing the conditional probability of predicting context words around the target word.

The training process of Word2Vec usually uses large amounts of text data. During training, the model adjusts the word embedding vector of a word so that it can better predict the context word or the target word. After training, these word embedding vectors can be used for a variety of natural language processing tasks, such as text classification, sentiment analysis, document similarity calculation, etc.

The main idea of Word2Vec is to capture the semantic information of words through the distribution of words in vector space, that is, words with similar semantics are closer in distance in vector space. This continuous vector representation allows the model to better understand the semantic relationships between words, rather than just simple word frequency statistics. Word2Vec has become a very useful tool in the field of natural language processing for improving the performance of various NLP tasks.

Model

https://ai.tencent.com/ailab/nlp/en/download.html
tencent-ailab-embedding-zh-d100-v0.2.0-s is a Chinese word embedding (Word Embedding) model released by Tencent. The name of this model contains some key information:

  1. “tencent” means it is developed or provided by Tencent.

  2. “ailab” indicates that this model may be related to Tencent’s Artificial Intelligence Laboratory (AI Lab).

  3. “embedding” means that this is a word embedding model that maps words into real vector space.

  4. “zh” indicates that this is a Chinese language model.

  5. “d100” means that the dimension of the word embedding vector of each word is 100, that is, each word is represented by a vector containing 100 real values in the vector space.

  6. “v0.2.0-s” may be the model’s version number or an identifier indicating the version of the model.

The main purpose of this model is to map Chinese words into a 100-dimensional vector space for use in natural language processing tasks. Such word embedding vectors are commonly used in text analysis, text classification, text similarity calculation, named entity recognition, sentiment analysis and other natural language processing tasks.

Code

import pandas as pd
import numpy as np
importgensim
from gensim.models import KeyedVectors
import jieba

# Load the pre-trained Chinese Word2Vec model
model_path = "model/tencent-ailab-embedding-zh-d100-v0.2.0-s/tencent-ailab-embedding-zh-d100-v0.2.0-s.txt"
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

# target sentence
target_sentence = "Keyword matching and text processing tasks"

# Segment the target sentence and calculate the sentence vector
def target_sentence_to_vector(sentence, model):
    words = list(jieba.cut(sentence))
    vector = np.zeros(model.vector_size)
    word_count = 0
    for word in words:
        if word in model:
            vector + = model[word]
            word_count + = 1
    if word_count > 0:
        vector /= word_count
    return vector

target_vector = target_sentence_to_vector(target_sentence, w2v_model)

# sentence
df = pd.DataFrame({<!-- -->'id': ['a', 'b', 'c', 'd', 'e' ],
                   'Sentences':["The weather is really nice today, the sun is shining brightly.",
                                "Keyword matching is a common text processing task.",
                                "The computer does not understand human language and needs to be converted into word vectors.",
                                "Prosperity, democracy, civilization, harmony, freedom, equality, justice, rule of law, patriotism, dedication, integrity, and friendliness.",
                                "Chinese word segmentation tools are very helpful for text processing.",]})

# Segment the sentence and calculate the sentence vector
def sentence_to_vector(sentence, model):
    words = list(jieba.cut(sentence))
    vector = np.zeros(model.vector_size)
    word_count = 0
    for word in words:
        if word in model:
            vector + = model[word]
            word_count + = 1
    if word_count > 0:
        vector /= word_count
    return vector

df['vector'] = df['Sentences'].apply(lambda x: sentence_to_vector(x, w2v_model))

# Calculate the similarity between "target_sentence" and the sentence in DataFrame
def calculate_similarity(target_vector, sentence_vector):
    similarity = np.dot(target_vector, sentence_vector) / (np.linalg.norm(target_vector) * np.linalg.norm(sentence_vector))
    return similarity

df['Similarity_Score'] = df['vector'].apply(lambda x: calculate_similarity(target_vector, x))

# Sort the DataFrame by "Similarity_Score" from high to low
df = df.sort_values(by='Similarity_Score', ascending=False)#.reset_index(drop=True)
df.set_index('id', inplace=True)
print('Similarity sorting:', df.index.to_list())
n = 10 # Select the top n most similar sentences
# Output a DataFrame containing "id", "sentence" and "Similarity_Score"
df[['Sentences', 'Similarity_Score']][:n]

Results

SpaCy

Concept

SpaCy (pronounced “spacey”) is an open source natural language processing (NLP) library and toolkit designed to process and analyze text data. It provides various functions and algorithms for natural language processing tasks such as text processing, lexical analysis, syntactic analysis, named entity recognition, part-of-speech tagging, dependency analysis, and text vectorization. SpaCy is designed with performance and efficiency in mind, making it very fast when processing large-scale text data.

Here are some of the main functions and features provided by SpaCy:

  1. Lexical analysis: SpaCy can segment text into sentences and words, tokenize them, and provide detailed information about each word such as stems, part-of-speech tags, and the word’s meaning.

  2. Syntax analysis: SpaCy can identify dependencies in sentences, that is, the grammatical relationships between words. This helps understand sentence structure and connections between words.

  3. Named Entity Recognition (NER): SpaCy can recognize named entities in text, such as person names, place names, organization names, etc., and classify them into different categories, such as person names, place names, dates, etc.

  4. Part-of-speech tagging: SpaCy can assign a part-of-speech tag to each word, helping users understand the grammatical roles of different words in the text.

  5. Text vectorization: SpaCy provides pre-trained word embedding models that convert text into vector representations, which facilitates tasks such as text classification, clustering, and similarity analysis.

  6. Support multiple languages: SpaCy supports multiple natural languages, including English, German, French, Spanish, Dutch, etc., so it can be used to process text data in different languages.

  7. High performance: SpaCy’s implementation is optimized to handle large-scale text data and therefore performs well in terms of performance.

SpaCy is widely used in natural language processing and text analysis for a variety of applications, including information retrieval, text classification, entity recognition, sentiment analysis, machine translation, and more. Due to its performance and functionality, SpaCy is a powerful tool for researchers and engineers working with text data.

Model

https://spacy.io/
SpaCy provides more than one Chinese language model to meet the needs of different application scenarios. The following are some common SpaCy Chinese language models:

  1. zh_core_web_sm: This is SpaCy’s lightweight Chinese language model that provides basic Chinese text processing functions, including word segmentation, part-of-speech tagging, syntactic analysis, and named entity recognition. It is suitable for general Chinese text processing tasks.

  2. zh_core_web_md: This model is larger than zh_core_web_sm and contains more lexical and grammatical information. It is suitable for tasks that require more complex analysis, such as text mining and text understanding.

  3. zh_core_web_lg: This is SpaCy’s large Chinese language model, which contains a large amount of vocabulary and grammatical information. It is suitable for processing large-scale text data, such as text classification, clustering and information retrieval.

The choice of these models depends on the specific application requirements and the size of the text data. Generally speaking, if you only need to do basic text processing and analysis, zh_core_web_sm may be sufficient. But if you need more grammatical and lexical information, or handle larger data sizes, you can consider using zh_core_web_md or zh_core_web_lg. SpaCy provides these different versions of Chinese language models so that users can flexibly choose the appropriate model based on their specific needs.

Code

import warnings
warnings.filterwarnings("ignore")
import spacy
import pandas as pd

# Load Spacy Chinese language model
nlp = spacy.load("zh_core_web_sm")

# sentence
df = pd.DataFrame({<!-- -->'id': ['a', 'b', 'c', 'd', 'e' ],
                   'Sentences':["The weather is really nice today, the sun is shining brightly.",
                                "Keyword matching is a common text processing task.",
                                "The computer does not understand human language and needs to be converted into word vectors.",
                                "Prosperity, democracy, civilization, harmony, freedom, equality, justice, rule of law, patriotism, dedication, integrity, and friendliness.",
                                "Chinese word segmentation tools are very helpful for text processing.",]})

# Select the target sentence to compare
target_sentence = "Keyword matching and text processing tasks"

# Calculate the similarity of all sentences
similarity_scores = []
for sentence in df["Sentences"]:
    doc1 = nlp(target_sentence)
    doc2 = nlp(sentence)
    similarity = doc1.similarity(doc2)
    similarity_scores.append(similarity)

# Add the similarity score to the data frame
df["Similarity Score"] = similarity_scores

# Sort in descending order according to the similarity score and select the n sentences with the highest similarity.
n = 10 # Select the top n most similar sentences
top_n_similar_sentences = df.sort_values(by="Similarity Score", ascending=False).head(n)
top_n_similar_sentences.set_index('id', inplace=True)
print('Similarity sorting:', top_n_similar_sentences.index.to_list())
#PrintDataFrame
top_n_similar_sentences

Results

Bert

Concept

BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing (NLP) model, which was released by Google in 2018 and is a deep learning model based on the Transformer architecture. BERT is a pre-trained model designed to learn the contextual relevance of words and phrases for use in a variety of natural language processing tasks.

The main features of BERT include:

  1. Bidirectionality: BERT emphasizes bidirectional context modeling, that is, both the left and right contexts of a word are considered during training. This allows BERT to better understand the meaning of words in different contexts.

  2. Pre-training: BERT obtains language understanding capabilities through pre-training on large-scale text corpora. It is “unsupervised” pre-trained on large-scale text data and learns the semantic representation of words and sentences.

  3. Fine-tuning: After pre-training, BERT can be adapted to specific natural language processing tasks, such as text classification, named entity recognition, sentiment analysis, etc., through fine-tuning. The fine-tuning process can be supervised based on specific tasks.

BERT has revolutionized the field of natural language processing as it has achieved leading performance on multiple tasks. Its pre-training capabilities enable it to capture complex semantic relationships between words and phrases, thereby reducing the need to manually design features. BERT’s model structure and pre-trained weights can also be used for transfer learning, so tasks on small-scale data sets can also benefit from BERT learning.

In addition to the original BERT model, there are various BERT-based variants, such as RoBERTa, ALBERT, DistilBERT, etc., which further improve performance through different training strategies and architectural improvements. BERT variants have become one of the main tools in natural language processing research and applications, and are widely used in text understanding and analysis tasks.

Model

https://huggingface.co/

“huggingface” is an open source platform in the field of natural language processing (NLP) that provides many pre-trained deep learning models and tools to help researchers and developers more easily use and deploy these models in text processing tasks. “bert-base-chinese” is a Chinese natural language processing model released by huggingface, based on the BERT (Bidirectional Encoder Representations from Transformers) model developed by the Google research team.

Here is some important information about “bert-base-chinese”:

  1. BERT model: BERT is a pre-trained neural network model trained on large-scale text data with a deep bidirectional encoder structure. This enables BERT to understand textual context and context, allowing it to excel in a variety of natural language processing tasks. The “bert-base-chinese” version is a BERT model pre-trained specifically for Chinese text.

  2. Chinese support: “bert-base-chinese” benefits from its pre-training dataset and can understand and process Chinese text. This includes multi-level text information such as Chinese word segmentation, syntactic structure, and semantics.

  3. Applicability: “bert-base-chinese” can be used for a variety of natural language processing tasks, including text classification, named entity recognition, sentiment analysis, question answering systems, etc. You can use this model to complete different Chinese text processing tasks without training a model from scratch.

  4. huggingface Transformers library: huggingface provides a Transformers library, which includes pre-trained models such as “bert-base-chinese” and tools for loading, fine-tuning, and evaluating these models. This allows developers to easily integrate and use these models to accelerate their natural language processing projects.

Code

import warnings
warnings.filterwarnings("ignore")
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import torch
from sklearn.metrics.pairwise import cosine_similarity

#Load the BERT model and tokenizer
model_name = "model/bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# sentence
df = pd.DataFrame({<!-- -->'id': ['a', 'b', 'c', 'd', 'e' ],
                   'Sentences':["The weather is really nice today, the sun is shining brightly.",
                                "Keyword matching is a common text processing task.",
                                "The computer does not understand human language and needs to be converted into word vectors.",
                                "Prosperity, democracy, civilization, harmony, freedom, equality, justice, rule of law, patriotism, dedication, integrity, and friendliness.",
                                "Chinese word segmentation tools are very helpful for text processing.",]})

# Select the target sentence to compare
target_sentence = "Keyword matching and text processing tasks"

# Moments calculate the embeddings of all sentences in the DataFrame
sentence_embeddings = []

for sentence in df["Sentences"]:
    # Use tokenizer to encode sentences
    sentence_tokens = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
    # Calculate the sentence embedding
    sentence_embedding = model(**sentence_tokens).last_hidden_state.mean(dim=1)
    sentence_embeddings.append(sentence_embedding)

# Calculate the embedding of target_embedding
target_sentence_tokens = tokenizer(target_sentence, return_tensors="pt", padding=True, truncation=True)
target_embedding = model(**target_sentence_tokens).last_hidden_state.mean(dim=1)

# Calculate the similarity of all sentences
similarity_scores = []

for sentence_embedding in sentence_embeddings:
    # Calculate cosine similarity
    similarity = cosine_similarity(target_embedding.detach().numpy(), sentence_embedding.detach().numpy())[0][0]
    similarity_scores.append(similarity)

# Add similarity score to DataFrame
df["Similarity Score"] = similarity_scores

# Sort in descending order according to the similarity score and select the n sentences with the highest similarity.
n=10
top_n_similar_sentences = df.sort_values(by="Similarity Score", ascending=False).head(n)

top_n_similar_sentences.set_index('id', inplace=True)
print('Similarity sorting:', top_n_similar_sentences.index.to_list())
#Display the DataFrame
top_n_similar_sentences

Results

Comparison