Naive Bayes – spam classification

Tip: After the article is written, the table of contents can be automatically generated. For how to generate it, please refer to the help document on the right.

Article Directory

Preface
1. Bayesian
2. Naive Bayes
3. Code
4. A little self-judgment
Summarize

Foreword

Whether spam has been bothering you, the Bayesian formula gives you the probability of finding out whether it is a junk file, reducing the worry of opening it only to find out that it is junk!

1. Bayesian formula

Bayes’ formula is used to describe the relationship between two conditional probabilities, such as P(A|B) and P(B|A).

(1) Conditional probability formula

Assume A and B are two events, and P(B)>0, then under the condition that event B occurs, the conditional probability of event A occurring is:

P(A|B)=P(AB)/P(B)

(2) The multiplication formula is obtained from the conditional probability formula: P(AB)=P(A|B)P(B)=P(B|A)P(A)

P(A|B) = P(AB)/P(B) can be transformed into:

$P(A|B)=\frac{P(B|A)\cdot P(A)}{P(B)}$

That is Bayes’ formula

Prior probability – that is, P (A), the probability of A that can be obtained based on past experience before event B occurs. If an email is received, it is spam.

Posterior probability – that is, P(A|B), a re-evaluation of the probability of A after B occurs. For example, if you receive an email, if it contains a certain word, the email is spam.

(3) Total probability formula:

$P(A)=P(A|B_{1})\cdot P(B_{1}) + P(A|B_{2})\cdot P(B_{2}) + .. .... + P(A|B_{n})\cdot P(B_{n})$

2. Naive Bayes formula

The difference between Naive Bayes and Bayes is that Naive Bayes adopts the Attribute conditional independence assumption, that is, each Attributes have an independent impact on the results

formula:
Let x represent the number of possible categories in the training set X, $C_{i}$ represents the number of possible values for the i-th attribute, then Bayes’ formula can be modified as

$P(C=c_{i}|X=x)=\frac{P(X=x|C=c_{i})}{P(X=x)}=\frac{P( C)}{P(X)}\prod_{i=1}^{d}P(x_{i}|C)$

Because there may be incomplete data collection, that is, some data is 0, which directly leads to another result, so Laplace correction can be used, and the formula is:

$P(x_{i}|c)=\frac{|D_{c,x_{i}}|+1}{|D|+N_{i}}$

Ni is the number of possible values for the i-th attribute

Because all result values are decimals, underflow may occur during the final multiplication, so logarithmization can be used:

$ln(ab)=lna + lnb$

Prevent underflow

3. Code

In this experiment, the data set source used

https://github.com/Jack-kui/machine-learning/tree/2272a56dcf13c98b8a3946eede2fb4fa02cb7c4b/Naive Bayes/email

The data set is divided into spam (garbage) and ham (normal)

spam email:

ham email:

Code implementation:

1. Create a glossary

#Create glossary
def createwordList(dataSet):
    #dataSet:A data set containing multiple documents
    vocabSet = set([])#Duplicate vocabulary list
    for document in dataSet:
        vocabSet=vocabSet|set(document)#Get the union
    return list(vocabSet)

2. Establish word bag and word set models

Word set model: A collection of words. The set naturally has only one element of each element, that is, there is only one word of each word in the set .

Bag of words model: Based on the word set, if a word appears more than once in the document, count the number of times it appears (frequency).

Different: Word Bag pays more attention to how many words there are in Word Collection

#Word set model
def setOfWords(vocabList,inputSet):
    #vocabList: Duplicate vocabulary list
    #inputSet: input document
    returnVec=[0]*len(vocabList) #Create an all-0 vector with the same length as the deduplication vocabulary list
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1 #Traverse all words, if they exist, make the corresponding vector =1
        else:print("The word sister~")
    return returnVec

#bag of words model
def bagOfWords(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] + =1
    return returnVec

3. Bayes’ formula

#Training Naive Bayes
def trainBYS(trainMatrix,trainClasses):
    #trainMatrix: The matrix composed of each returnVec
    #trainClasses: The category corresponding to each reVec, 1 insult class 0 normal class

    numTrainDocs=len(trainMatrix) #Total number of documents
    numWords=len(trainMatrix[0]) #Total number of words in each document
    p_1=sum(trainClasses)/float(numTrainDocs) #The document belongs to the insult category
    #laplacian correction_prevent underflow
    p0Num=np.ones(numWords) #The number of occurrences of each word in the molecule is initialized to 1
    p1Num=np.ones(numWords)
    p0Denom=2.0 #The total number of words in the denominator is initialized to the number of categories 2
    p1Denom=2.0
    
    for i in range(numTrainDocs): #Traverse training samples
        if trainClasses[i]==1:
            p1Num + =trainMatrix[i] #The number of words in the insult category
            p1Denom + =sum(trainMatrix[i]) #Total number of insult words
        else:
            p0Num + =trainMatrix[i]
            p0Denom + =sum(trainMatrix[i])

    p1Vect=np.log(p1Num/p1Denom) #Take the logarithm
    p0Vect=np.log(p0Num/p0Denom)
    #print(p0Vect)
    return p0Vect,p1Vect,p_1

4. Classified documents

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    '''
    Classify test documents
    
    Parameter:
    vec2Classify: test document vector
    p0Vec: conditional probability of each word in the normal class
    p1Vec: conditional probability of each word in the insult class
    pClass1: Probability of insult class in total sample
    
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

5. Convert all words to lowercase

#Convert string to lowercase character list
def textParse(bigString):
    '''
    Parameter:
    bigString: input string
    
    Return:
    tok.lower(): list of lowercase characters
    '''
                                             
    #Use special symbols as segmentation marks to segment strings, that is, non-letters and non-digits
    listOfTokens = re.split(r'\W + ', bigString)
    #Except for single letters, other words become lowercase
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

6. Test the classifier

#Test the naive bys classifier
def spamTest(method='bag'):
    #method has two categories: bag word bag and set word set
    
    if method == 'bag': #Determine whether to use the bag of words model or the word set model
        words2Vec = bagOfWords
    elif method == 'set':
        words2Vec = setOfWords
    
    docList=[]
    classList=[]

    #Traverse folders
    for i in range(1,26):
        #Read spam and convert to multiplied string list
        wordList=textParse(open('C:/Users/kid/Desktop/ML/machine-learning-master/Naive Bayes/email/spam/%d.txt' % i,'r').read())
        #Add the list record to the document list and classify it as 1 insult category
        docList.append(wordList)
        classList.append(1)
        #Read normal files
        wordList=textParse(open('C:/Users/kid/Desktop/ML/machine-learning-master/Naive Bayes/email/ham/%d.txt' % i,'r').read())
        docList.append(wordList)
        classList.append(0)


    #Create a deduplication vocabulary
    vocabList=createwordList(docList)
    #print(vocabList)

    #Create index
    trainSet=list(range(50))
    testSet=[]

    #split test urgent
    for i in range(10):
        # Randomly select ten from the index and remove ten from the index
        randIndex=int(random.uniform(0,len(trainSet)))
        testSet.append(trainSet[randIndex])
        del(trainSet[randIndex])

    #Create training set matrix and category vector
    trainMat=[]
    trainClasses=[]

    for docIndex in trainSet:
        #Add the generated model to the training matrix and record the categories
        trainMat.append(words2Vec(vocabList,docList[docIndex]))
        trainClasses.append(classList[docIndex])

    #trainingbys
    p0V,p1V,pSpam=trainBYS(np.array(trainMat),np.array(trainClasses))

    #error classifier
    error=0

    for docIndex in testSet:
        wordVec=words2Vec(vocabList,docList[docIndex])
        #test
        if classifyNB(np.array(wordVec),p0V,p1V,pSpam)!=classList[docIndex]:
            error + =1
            #print("Wrong classification:",docList[docIndex])
    #print("Error rate: %.2f%%"%(float(error)/len(testSet)*100))
    errorRate=float(error)/len(testSet)#Classification error rate
    return errorRate

7. Call for experiments

if __name__ == "__main__":
    total=100
    print('Train using bag-of-words model:')
    sum_bag_error = 0
    for i in range(total):
        sum_bag_error + = spamTest(method = 'bag')
    print('Using bag-of-word model training' + str(total) + 'The average error rate obtained: ' + str((sum_bag_error / total)))

The result is:

4. A little self-judgment

After performing Naive Bayes training on existing emails, we can determine whether some of the emails we collected are spam.

content of email:

Read it as input:

#Read your own input set test
    testList=textParse(open('C:/Users/kid/Desktop/ML/machine-learning-master/Naive Bayes/email/testy/1.txt','r').read())

Combine it with the training deduplication vocabulary for document classification

wordVec=words2Vec(vocabList,testList)

Conduct a classification test to determine whether it is insulting category 1 or non-insulting category 0

if classifyNB(np.array(wordVec),p0V,p1V,pSpam)!=1:

Judgment result:

The error rate is 0, which is determined to be an insult.

Summary

English documents can be classified, and we will work hard to implement Chinese classification next time