n-gram language model – text generation source code

Basic principles of n-gram model
Steps of text generation
- 1. Preparation and word segmentation
- 2. Build n-gram model
- 3. Application of smoothing technology
- 4. Generate text
Source code

Please add an image description

In the field of natural language processing, the n-gram language model is a basic and powerful tool. It is effectively used in text generation tasks by considering the sequence of words to predict text content. This blog will discuss how to use the n-gram model, especially when processing Chinese text, using jieba for word segmentation and the nltk library for model construction.

I explained the principle of n-gram in my previous blog. Reference: n-gram language model – calculation and smoothing of sentence probability distribution.

Basic principles of n-gram model

n-gram model is based on a simple assumption: the occurrence of a word is only related to a limited number of words before it. Such models can be divided into different types, such as bigrams and trigrams, depending on how many previous words we consider.

Taking bigram as an example, it can be approximated that the probability of a word appearing only depends on the word before it:

In order to make

(

∣

)

p(w_i | w_{i-1})

p(wi?∣wi?1?) is meaningful when i is 1. We usually add a start mark (BOS) at the beginning of the sentence and an end mark (EOS) at the end of the sentence to include it in the probability calculation. middle. For example, to calculate the probability that Mark wrote a book, we would calculate it like this:

(

Mark wrote a book

)

(

Mark

∣

BOS

)

(

wrote

∣

Mark

)

(

∣

wrote

)

(

book

∣

)

(

EOS

∣

book

)

p(\text{Mark wrote a book}) = p(\text{Mark} | \text{BOS}) \cdot p(\text{wrote} | \text{Mark}) \ cdot p(\text{a} | \text{wrote}) \cdot p(\text{book} | \text{a}) \cdot p(\text{EOS} | \ text{book})

p(Mark wrote a book)=p(Mark∣BOS)?p(wrote∣Mark)?p(a∣wrote)?p(book∣a)?p(EOS∣book)

For estimation

(

∣

)

p(w_i | w_{i-1})

p(wi?∣wi?1?), you can simply calculate the frequency of word w in a certain text and then normalize it. If c is used to represent the number of occurrences in a given text, we can use the following formula:

(

∣

)

(

)

∑

(

)

p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{\sum_{w} c(w_{i-1}, w)}?

p(wi?∣wi?1?)=∑w?c(wi?1?,w)c(wi?1?,wi?)

The above formula is the Maximum Likelihood Estimation (MLE). This formula also applies to higher-order n-gram models.

Steps of text generation

1. Preparation and word segmentation

Use jieba to segment Chinese text. This is the first step in processing the Chinese n-gram model. The results after word segmentation are used to build the n-gram model.

2. Build n-gram model

Use the ngrams function of nltk to create a bigrams sequence from the word segmentation results. These bigrams are then used to construct a conditional frequency distribution object for subsequent text generation.

3. Application of smoothing technology

In the n-gram model, in order to deal with word combinations that do not appear in the training data, smoothing technology needs to be used. Lidstone smoothing and Laplace smoothing are two common methods. These methods avoid the zero probability problem and make the model more robust by adding a small non-zero value to the count of word combinations.

Lidstone smoothing and Laplace smoothing refer to the previous blog n-gram language model – sentence probability distribution calculation and smoothing

4. Generate text

The process of text generation starts from an initial word and continuously generates the next word according to the conditional frequency distribution. This process is repeated until the required number of words is reached or a stopping condition is encountered.

Source code

The following is an example of using the Laplace smoothed n-gram model to generate text:

import nltk
from nltk.probability import LidstoneProbDist, LaplaceProbDist
from nltk.corpus import brown
from nltk import FreqDist, ConditionalFreqDist
import random
import jieba

# Sample text read ylk.txt
text = open("ylk.txt", encoding="utf-8").read()
# jieba participle
tokens = jieba.cut(text)
# Generate bigrams
bi_grams = list(nltk.ngrams(tokens, 2))
#Create a conditional frequency distribution object
cfd = ConditionalFreqDist(bi_grams)

# # Use Lidstone smoothing
# # Lidstone smoothing with gamma value less than 1
lidstone_cfd = {<!-- -->condition: LidstoneProbDist(cfd[condition], gamma=0.1) for condition in cfd.conditions()}

# Use Laplace smoothing
# Laplace smoothing is a special case of gamma=1
laplace_cfd = {<!-- -->condition: LaplaceProbDist(cfd[condition]) for condition in cfd.conditions()}

def generate_text(initial_word, cfd, num_words=50):
    current_word = initial_word
    generated_text = [current_word]

    for _ in range(num_words - 1):
        if current_word not in cfd:
            break
        next_word = random.choices(
            population=list(cfd[current_word].samples()),
            weights=[cfd[current_word].prob(w) for w in cfd[current_word].samples()]
        )[0]
        generated_text.append(next_word)
        current_word = next_word

    return ''.join(generated_text)

# Example: Generate text starting from "haha"
print(generate_text("Fang Rui", laplace_cfd, 100))
print(generate_text("Fang Rui", lidstone_cfd, 100))

In the code, I downloaded more than 4M online novels as a corpus. Then start generating based on the protagonist’s name, the results are as follows:

I feel lidstone is more effective.

By using the n-gram model combined with smoothing technology, text that conforms to language rules can be effectively generated. Although this method is simple, it is still very effective in many application scenarios, especially when resources are limited.

With the development of deep learning technology, more complex language models have emerged, and the n-gram model feels that it is no longer suitable in the field of text generation~