Naive Bayesian Algorithm for English Text Classification

Table of Contents

  • 1. Introduction of the author
  • 2. Introduction and Case of Naive Bayesian Algorithm
    • 2.1 Introduction to Naive Bayes Algorithm
    • 2.2 Text Classifier
    • 2.3 Text classification of news texts
  • 3. Python code implementation
    • 3.1 Text Classifier
    • 3.2 News Text Classification
  • References (links and citations available for reference)

1. Introduction of the author

Liang Youcheng, male, School of Electronic Information, Xi’an Polytechnic University, 2022 graduate student
Research direction: 3D point cloud processing
Email: [email protected]

Lu Zhidong, male, School of Electronic Information, Xi’an Polytechnic University, Zhang Hongwei’s artificial intelligence research group for 2022 graduate students
Research direction: machine vision and artificial intelligence
Email: [email protected]

2. Introduction and Case of Naive Bayesian Algorithm

2.1 Introduction to Naive Bayesian Algorithm

Naive Bayes algorithm (Naive Bayes) is a probabilistic algorithm based on probability theory and Bayesian algorithm to predict the classification of text (such as news or customer reviews). Bayesian algorithms make text probabilistic – this algorithm calculates the probability of a given text being each class, and then outputs the class with the highest probability value.

More abstractly speaking, the Naive Bayesian classifier will separately consider the conditional probability of each dimension feature being classified, and then integrate these probabilities and make a classification prediction for the feature vector where it is located. Therefore, the basic mathematical assumption of this model is that the conditional probabilities of the features in each dimension being classified are independent of each other. Naive Bayesian model has a wide range of practical applications, especially in text classification tasks, including Internet news classification, spam screening, etc.

Basis of Naive Bayesian thinking: For the item to be classified, find out the probability of each category under the condition that this item appears, whichever is the largest, consider which category the item to be classified belongs to. In layman’s terms, it’s like you see a black man on the street, and if I ask you where he is from, you will probably say Africa. why? Because blacks have the highest proportion of Africans, of course others may be American or Latin, but in the absence of other available information, we will choose the category with the highest conditional probability, which is the basis of Naive Bayesian thinking.

Definition of Bayesian classification:
1. Let x={a1, a2, …, am} be an item to be classified, and each a is a characteristic attribute of x;
2. There is a category set C = {y1,y2, …, yn}
3. Calculate P(y1|x), P(y2|x),…, P(yn|x)
4. If P(yk|x)= max{ P(y1|x), P(y2|x), …, P(yn|x) }, then x∈yk.

The key is how to calculate the individual conditional probabilities in step 3. step:
1. Find a set of items to be classified with a known classification, which is called a training sample set.
2. Statistically obtain the conditional probability estimates of each feature attribute under each category. Right now:

3. If each characteristic attribute is conditionally independent, then according to Bayes’ theorem, there are the following derivations:

Because the denominator is constant for all categories, just maximize the numerator. And because each feature attribute is conditionally independent, there are:

Naive Bayes process:

It can be seen that the entire naive Bayesian classification is divided into three stages:

The first stage-preparation stage, the task of this stage is to make necessary preparations for Naive Bayesian classification, the main work is to determine the feature attributes according to the specific situation, and properly divide each feature attribute, and then manually classify some The items to be classified are classified to form a training sample set. The input of this stage is all the data to be classified, and the output is feature attributes and training samples. This stage is the only stage that needs to be completed manually in the entire Naive Bayesian classification, and its quality will have an important impact on the entire process. The quality of the classifier is largely determined by the feature attributes, feature attribute division, and training sample quality.

The second stage – classifier training stage, the task of this stage is to generate a classifier, the main work is to calculate the frequency of occurrence of each category in the training samples and the conditional probability estimation of each category for each feature attribute division, and The results are recorded. Its input is feature attributes and training samples, and the output is a classifier. This stage is a mechanical stage, which can be automatically calculated by the program according to the formula discussed above.

The third stage – the application stage. The task of this stage is to use the classifier to classify the items to be classified, the input is the classifier and the items to be classified, and the output is the mapping relationship between the items to be classified and the categories. This stage is also a mechanical stage, completed by the program.

2.2 Text Classifier

Let’s use this example to briefly introduce the Naive Bayes algorithm:
Suppose we are building a classifier that will determine whether a text is about sports or not. Our training data has the following 5 sentences:

text is text, tag is the category of text, here it is simply divided into sports and non-sports
Based on the classification of these five sentences, we will determine which category the sentence “A very close game” should be classified into.

As a probabilistic classifier, we want to calculate the probability that the sentence “A very close game” belongs to the category of Sports (Sports), and the probability that it is not Sports (Not sports). After the calculation is complete, we will classify him into a class with a higher probability value.

Calculate the main steps
(1) Create your features – digitize your features
To use a formula for mathematical calculation, the first thing we need is numbers, but for a text, we don’t have direct elements to get, so we first need to create a feature that can be expressed by numbers and reflect text information.

For English, one of the easiest ways to digitize text information is to calculate the word frequency of each word in a text. In other words, we can simply ignore the order of words and the structure of sentences, and treat each piece of text roughly as a collection of words. So the feature we extract will be the number of occurrences of each word. This method looks very simple and crude, but the effect is surprisingly good.

(2) Obtain the probability formula from Bayes theorem
In fact, in our data, we have no way to get P(sports|”A very close game”) directly, but Bayes’ law provides a way to calculate the conditional probability:

So according to this formula, we can convert our P(sports|”A very close game”) into:

Since the denominator in the calculation formula of sports and notsports is the same, we can discard the divisor when comparing, and then compare directly, namely:

The probabilities in these two formulas can be obtained directly from our data. For example, P (a very close game|Sports) only needs to calculate the number of times the sentence “A very close game” appears in the “Sport” category, and then divide by The total number is enough, and P(Sports) only needs to calculate the sports type text divided by the total number of texts. At this point, we have all the data we need.

However, there is still a problem: “A very close game” does not appear in our training data, so if we calculate it directly, the probability we calculate will be zero. This model won’t be very effective unless every sentence we want to classify appears in our training data, and we need a more general model.

(3) Simplify the model
Thus, a naive model emerges: we assume that each word in a sentence is independent of other words. This means that instead of looking at entire sentences, we look at individual words.

Based on the previous ideas, we can write this formula:

In fact, this assumption does work quite well, and allows Naive Bayes models to work well with small amounts of data and potentially misclassified data. The next step is to apply this to our previous work:

This way, all the words actually occur several times in our training data, so we can calculate the probability.

(4) Calculate the final probability
At this point, we just need to calculate the probability of each category (Sports and not sports) and see which has a higher probability.

I just used the data from the training data for calculations.

First, we compute the prior probability for each category: for a given sentence in our training data, it is Sports
The probability of P(Sports) is 3/5. And P (Not sports) is 2/5.

Then, calculate , which is to calculate the number of occurrences of the word “game” in the text of Sports (2 times) divided by the total number of words in the text of Sports (11 times). Therefore, one can get:

Going on like this we run into another problem: “close” doesn’t appear in any of the Sports text. In this calculation, the probability will be 0. And our final formula needs to multiply several probabilities:

In this way, our previous efforts will be in vain because of a word that never appeared.

But Laplace smoothing can solve this problem very well: that is, we add 1 to each count, so that they will never be zero. To balance this, we add the number of possible words to the divisor (the number of words that occur in the training sample. Note: if a particular word occurs more than once in the training sample, the number of that word will be counted for 1 time). So the value of division will never be greater than 1.
In our data, the possible words are:
[‘a’, ‘great’, ‘very’, ‘over’, ‘it’, ‘but’, ‘game’, ‘election’, ‘clean’, ‘close’, ‘the’, ‘was’, ‘ forgetable’, “match”].

Since the number of all possible words is 14, applying smoothing gives:

The final result is:

In this way, there is only the last step left. Now we calculate all the probabilities and see which one is more probable:

It can be seen that 0.0000276>0.00000572, so our classifier classifies the text “A very close game” as text related to Sports.

2.3 Text classification of news text

Text feature vectorization

To use the naive Bayesian model to classify text data, it is necessary to vectorize the text features of the text data
This lesson uses CountVectorizer for text feature vectorization
CountVectorizer counts the number of occurrences of words in a specific document (counts word frequency)
CountVectorizer calculates the number of occurrences of each word through the fit_transform() function

Load news data, text classification
In this case, use the sklearn.datasets.fetch_20newsgroups function to download news data (time-consuming)
Text classification using sklearn.naive_bayes.MultinomialNB

Question elicited
This article uses the classic 20 categories of news texts integrated in Scikit-learn as experimental data.
It should be reminded that the data in this article is not pre-stored and needs to be downloaded from the Internet in real time, and the download speed inside the wall is a bit slow, so it is recommended to download over the wall.
Let’s look at the data first:
from sklearn import datasets
news = datasets.fetch_20newsgroups(subset=’all’)
print(len(news.data)) #news.data data format is list
print(news. data[0])

From the figure we can see that our original data contains a list of 18846 texts, with a total of 20 categories. Since the raw data are neither characterized nor quantified. Therefore, the data needs to be further processed before being handed over to the Naive Bayesian classifier for learning.

3. Python code implementation

3.1 Text Classifier

1. Source code

'''Calculate the number of times a word appears in the tag class'''
def countWord(word, tag):
    count = 0
    for sentence in tag:
        for w in sentence. split():
            if(word == w):
                count = count + 1
    print(word,": ", count)
    return count

'''Calculate the Naive Bayesian probability that the text str belongs to the tag class'''
def caculateNBP(str,tag,Ptag,All,tagNum):
    str = str. lower()
    words = str. split(" ")
    pList = []
    for word in words:
        P = (1 + countWord(word,tag))/(All + tagNum)
        pList.append(P)
    NBP = 1
    print(pList)
    for P in pList:
        NBP = NBP * P
    NBP = NBP*Ptag
    return NBP
'''Training data'''
texts = ["A great game", "The election was over", "Very clean match", "A clean but forgettable game", "It was a close election"]
Sports = ["A great game", "Very clean match", "A clean but forgettable game"]
NotSports = ["The election was over", "It was a close election"]
'''To lowercase'''
texts2 = [s.lower() for s in texts if isinstance(s,str)==True]
Sports2 = [s.lower() for s in Sports if isinstance(s,str)==True]
NotSports2 = [s.lower() for s in NotSports if isinstance(s,str)==True]
'''Calculate P(Sports) and P(NOT SPORTS)'''
Psports = len(Sports)/len(texts)
PnotSports = len(NotSports)/len(texts)
str = "100"
allWords = []#All possible words
for sentence in texts2:
    temp = sentence. split(" ")
    allWords = allWords + temp
allWords = set(allWords)
allWords = list(allWords)
lenAllWords = len(allWords)#The number of all possible words

numofSports = 0#Number of words in the Sports class text
for sentence in Sports2:
    temp = len(sentence. split())
    numofSports = numofSports + temp

numofNotSports = 0#notSports the number of words in the class text
for sentence in NotSports2:
    temp = len(sentence. split())
    numofNotSports = numofNotSports + temp

str = "A very close game"
print("Calculation of sports probability: ----------------------------------------- --")
pSports = caculateNBP(str,Sports2,Psports,numofSports,lenAllWords)
print("P(STR|SPORTS)= ",pSports)
print("Calculate the probability of not sports: ------------------------------------------ ---")
pNotSports = caculateNBP(str,NotSports2,PnotSports,numofNotSports,lenAllWords)
print("P(STR|NOTSPORTS)= ", pNotSports)
print("Classification result: ")
if(pSports > pNotSports):
    print("Text ", str," can be divided into Sports category")
else:
    print("Text ", str," can be classified as Not Sports")

2. Running results

3.2 News Text Classification

1. Source code

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#retrieve data
news = datasets.fetch_20newsgroups(subset='all')

#Split the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(news.data, news.target,
test_size=0.25, random_state=33)

#Import the text feature vector conversion module and vectorize the text features
vec = CountVectorizer()
X_train = vec.fit_transform(X_train)
X_test = vec.transform(X_test) #Note that this function is different from the above

# Initialize the Naive Bayesian model with the default configuration
mnb = MultinomialNB()
#Use the training data to estimate the model parameters and make predictions
mnb. fit(X_train, y_train)
y_predict = mnb. predict(X_test)

# output prediction result
print('The Accuracy of Naive Bayes:', mnb. score(X_test, y_test))
print(classification_report(y_test, y_predict, target_names=news. target_names))

2. Running results

From the results in the figure above, it can be seen that the classification accuracy of Naive Bayes for 4712 test samples is about 83.977%, and the result is quite good.

summary:
1. Naive Bayes is widely used in massive Internet text classification tasks.
2. Due to its strong independent assumption of characteristic conditions, the estimated parameter scale required for model prediction is reduced from the order of magnitude to the order of linearity, which greatly saves memory consumption and computing time.
3. Restricted by this strong assumption, the connection between features cannot be taken into account during model training, which makes the model perform poorly in classification tasks with strong correlation between features.

References (links and citations available for reference)

1.[Simple naive Bayesian algorithm to realize English text classification (Python implementation, use the last experimental code to go through the classification of sport and not sport. The process includes: 1. Create_Akira_oono’s blog-CSDN blog ]https://blog.csdn.net/weixin_44405843/article/details/109365790
2. [sk-learn example – classify text with Naive Bayes algorithm (Naive Bayes)_Zhang Daqian 09’s Blog-CSDN Blog]
https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/