Exploring Topic Modeling: Using LDA to Analyze Text Topics

In the fields of data analysis and text mining, topic modeling is a powerful tool for automatically discovering hidden topics in text data. Latent Dirichlet Allocation (LDA) is a common technique for topic modeling. This article will introduce how to perform LDA topic modeling using Python and the Gensim library and explore various aspects of topic modeling.

What is topic modeling?

Topic modeling is a technique used to extract themes or topics from text data. Topics can be thought of as general descriptions of text data that cover key concepts in the text. Topic modeling can be applied to various fields, such as document classification, information retrieval, recommendation systems, etc.

LDA application scenarios

The LDA (Latent Dirichlet Allocation) model is widely used in the field of natural language processing (NLP). The following are some common application scenarios.

  1. Feature generation: LDA can generate features for use by other machine learning algorithms. For example, LDA infers a topic distribution for each article; the K topics are K numerical features, which can be used in algorithms like logistic regression or decision trees for prediction tasks.
  2. News quality classification: The quality of news obtained by news APPs from various sources varies from good to bad. We can manually design some traditional features: news source site, length of news content, number of pictures, news popularity, etc. In addition to these artificial features, topic models can also be used to calculate the topic distribution of each news article as additional features to form a new feature set together with artificial features.
  3. Short text-short text semantic matching: Short text-short text semantic matching has a wide range of application scenarios in the industry. For example, in web search, we need to measure the semantic correlation between user query and web page title; in query recommendation, we need to measure the similarity between query and other queries.
  4. Short text-long text semantic matching: The application scenario of short text-long text semantic matching is very common in the industry. For example, in a search engine, we need to calculate the semantic relevance of a user query (query) and a web page text (content).
  5. Long text-long text semantic matching: By using topic models, we can get the topic distributions of two long texts, and then measure the similarity between them by calculating the distance between the two multinomial distributions.
  6. Personalized news recommendation: Long text-long text semantic matching can be used in the task of personalized recommendation. For example, in personalized news recommendation, we can merge the news (or news titles) recently read by the user into a long “document”, and use the topic distribution of the “document” as a user profile that expresses the user’s reading interest.
  7. Vertical news CTR estimation: News recommendation services involve multiple vertical news directions, such as sports, automobiles, entertainment, etc. In these directions, we often need to make more sophisticated personalized recommendations.

Using LDA for topic modeling

Latent Dirichlet Allocation (LDA) is a probabilistic graphical model for topic modeling. The basic idea is that each document is composed of a set of topics, and each topic is composed of a set of words. LDA attempts to find the best combination of topics and words to explain the given text data.

image.png

Digging friends who are interested in the underlying logic can refer to these articles:

https://zhuanlan.zhihu.com/p/309419680

https://zhuanlan.zhihu.com/p/31470216

Here are the steps on how to perform LDA topic modeling using Python and the Gensim library:

Step 1: Text preprocessing

Before topic modeling, text needs to be preprocessed. This includes word segmentation, removal of stop words, punctuation, etc. To segment words, you can use tools such as jieba, and to remove stop words, you can use the nltk library.

Example:

# Chinese text segmentation
def tokenize(text):
    return list(jieba.cut(text))

# Delete Chinese stop words
def delete_stopwords(text,tokens):
    # Participle
    words = tokens # Assume that you already have text with good words divided into words. If not, you can use tools such as jieba for word segmentation.

    # Load Chinese stop words
    stop_words = set(stopwords.words('chinese'))

    # Remove stop words
    filtered_words = [word for word in words if word not in stop_words]

    # Rebuild text
    filtered_text = ' '.join(filtered_words)

    return filtered_text
def remove_punctuation(input_string):
    import string
    # Make a mapping table in which all punctuation marks are mapped to None
    all_punctuation = string.punctuation + "!."#$%&'()*+,-/:; 『』【】〔〕〖〗〝〞––''?“”…?﹏.\t "
    translator = str.maketrans('', '', all_punctuation)
    # Use mapping table to remove all punctuation
    no_punct = input_string.translate(translator)
    return no_punct

These functions can be used in text preprocessing to prepare text data for natural language processing tasks. Here is the description of the function:

  1. tokenize(text): This function uses the jieba thesaurus to divide Chinese text into words. It accepts a text string as input and returns a list containing the word segmentation results.
  2. delete_stopwords(text, tokens) : This function is used to delete stop words in Chinese text. It accepts two parameters, a text string and a worded text (word list). The function first loads the Chinese stop word list, then removes the stop words from the text, and finally returns a text string with the stop words removed.
  3. remove_punctuation(input_string) : This function is used to remove punctuation marks from text. It uses a mapping table that maps all punctuation characters to None, thereby removing them. Finally, it returns a text string with punctuation removed.

This completes simple data preprocessing.

Step 2: Create dictionary and document-word frequency matrix

LDA uses a bag-of-words model. The so-called bag-of-words model is a document in which we only consider whether a word appears, regardless of the order in which it appears. In the bag-of-words model, “I like you” and “You like me” are equivalent. A model opposite to the bag-of-words model is n-gram, which takes into account the order in which words appear.

Using the Gensim library, it is possible to create a dictionary of documents and a document-term frequency matrix. The dictionary contains the words in all documents, and the document-word frequency matrix represents the word frequency of each word in each document.

# Create dictionary and document-word frequency matrix
dictionary = corpora.Dictionary([tokens])
corpus = [dictionary.doc2bow(tokens)]
  1. dictionary = corpora.Dictionary([tokens]) : This line of code creates a dictionary (Dictionary) of the document. Vocabulary is used to map words in text to unique IDs. tokens is a list of text data containing tokens. The purpose of creating a vocabulary is to establish a mapping between each word and a unique ID for subsequent processing.
  2. corpus = [dictionary.doc2bow(tokens)] : This line of code creates a document-term frequency matrix (Corpus). corpus is a list containing documents, each document is represented as a bag of words (Bag of Words), which contains the ID and word frequency of each word in the document. The doc2bow method converts the words in the document into a bag-of-words representation.

To facilitate understanding of these two types of data structures, refer to the following code sample demonstration:

def test():
    from gensim import corpora

    #Create a sample text data
    sample_texts = [
        "This is the first document This This ",
        "This document is the second document ",
        "And this is the third one ",
        "Is this the first document "
    ]

    # Segment words and create vocabulary
    tokenized_texts = [text.split() for text in sample_texts]
    dictionary = corpora.Dictionary(tokenized_texts)

    # Get the mapping of words in the vocabulary to IDs
    word_to_id = dictionary.token2id

    # Get the mapping from ID to word
    id_to_word = {v: k for k, v in word_to_id.items()}



    #Print ID to word mapping
    print("Mapping from ID to word:")
    for word_id, word in id_to_word.items():
        print(f"ID: {word_id}, word: {word}")


    #Create document-term frequency matrix
    corpus = [dictionary.doc2bow(tokens) for tokens in tokenized_texts]

    #Print document-word frequency matrix
    print("Document-word frequency matrix:")
    for doc in corpus:
        print(doc)

The running result is

Step 3: Run the LDA model

LDA models can be run using Gensim’s LdaModel class. The number of topics, dictionary, and document-term frequency matrix need to be specified as input parameters. The model will automatically learn the distribution of topics and words.

# Run LDA model
lda_model = models.LdaModel(corpus, num_topics=15, id2word=dictionary, passes=50)
  • num_topics represents the number of topics expected to be generated. In LDA, this is a hyperparameter that needs to be specified in advance. We need to choose the appropriate number of topics based on your data and analysis goals. Typically, we can determine the number of topics based on domain knowledge or experimentation.
  • passes is the number of iterations of the model. The LDA model optimizes the distribution of topics as well as the document-topic and word-topic distributions through multiple iterations. Increasing the value of passes usually improves model performance, but also increases training time. Typically, a pass value between 10-50 is a common choice, depending on the size and complexity of the data set.

Step 4: Extract the topic

Once the model training is complete, topics can be extracted using the show_topics method. Each topic is represented by a set of high-weighted words.

# Extract topic
topics = lda_model.show_topics(num_words=8)

# Output topic
for topic in topics:
    print(topic)

As follows:

The serial number in front is the topic id, the words in the back are topic-related words, and the number in front of the related words is the weight of the related words in the topic.

Step 5: Result Analysis

Finally, the extracted themes were analyzed and interpreted. You can view high-weight words, understand the content of the topic, and use topic models for document classification, information retrieval and other applications.

How to save and load models

In practical applications, it is usually necessary to save the trained LDA model for next time use. Models can be saved and loaded using Gensim’s save and load methods.

Save model:

from gensim import corpora, models
import os

# Assume you already have a corpus `corpus` and dictionary `dictionary`, as well as a trained LDA model `lda_model`

# Save dictionary
dictionary.save("my_dictionary.dict")

# Save corpus
corpora.MmCorpus.serialize("my_corpus.mm", corpus)

# Save LDA model
lda_model.save("my_lda_model.model")

Load model:

from gensim import corpora, models

#Load dictionary
dictionary = corpora.Dictionary.load("my_dictionary.dict")

#Load corpus
corpus = corpora.MmCorpus("my_corpus.mm")

#Load LDA model
lda_model = models.LdaModel.load("my_lda_model.model")

The role of weight values

In the LDA model, each word has a weight value indicating its importance in the topic. These weight values can be used for topic identification, document classification and information retrieval. High-weight words are usually related to the topic, so they can help understand the topic content or build a topic word cloud diagram.

Summary

Topic modeling is an important technology in the field of text mining, which can automatically discover topics in text data. LDA is a commonly used topic modeling method that can be implemented through Python and Gensim libraries. Through text preprocessing, model training, and result analysis, hidden topics in text data can be effectively extracted for various applications.