Tip: After the article is written, the table of contents can be automatically generated. For how to generate it, please refer to the help document on the right.
Article Directory
- Preface
- 1. Bayesian
- 2. Naive Bayes
- 3. Code
- 4. A little self-judgment
- Summarize
Foreword
Whether spam has been bothering you, the Bayesian formula gives you the probability of finding out whether it is a junk file, reducing the worry of opening it only to find out that it is junk!
Related content
1. Bayesian formula
Bayes’ formula is used to describe the relationship between two conditional probabilities, such as P(A|B) and P(B|A).
(1) Conditional probability formula
Assume A and B are two events, and P(B)>0, then under the condition that event B occurs, the conditional probability of event A occurring is:
P(A|B)=P(AB)/P(B)
(2) The multiplication formula is obtained from the conditional probability formula: P(AB)=P(A|B)P(B)=P(B|A)P(A)
P(A|B) = P(AB)/P(B) can be transformed into:
That is Bayes’ formula
Prior probability – that is, P (A), the probability of A that can be obtained based on past experience before event B occurs. If an email is received, it is spam.
Posterior probability – that is, P(A|B), a re-evaluation of the probability of A after B occurs. For example, if you receive an email, if it contains a certain word, the email is spam.
(3) Total probability formula:
2. Naive Bayes formula
The difference between Naive Bayes and Bayes is that Naive Bayes adopts the Attribute conditional independence assumption, that is, each Attributes have an independent impact on the results
formula:
Let x represent the number of possible categories in the training set X, represents the number of possible values for the i-th attribute, then Bayes’ formula can be modified as
Because there may be incomplete data collection, that is, some data is 0, which directly leads to another result, so Laplace correction can be used, and the formula is:
Ni is the number of possible values for the i-th attribute
Because all result values are decimals, underflow may occur during the final multiplication, so logarithmization can be used:
Prevent underflow
3. Code
In this experiment, the data set source used
https://github.com/Jack-kui/machine-learning/tree/2272a56dcf13c98b8a3946eede2fb4fa02cb7c4b/Naive Bayes/email
The data set is divided into spam (garbage) and ham (normal)
spam email:
ham email:
Code implementation:
1. Create a glossary
#Create glossary def createwordList(dataSet): #dataSet:A data set containing multiple documents vocabSet = set([])#Duplicate vocabulary list for document in dataSet: vocabSet=vocabSet|set(document)#Get the union return list(vocabSet)
2. Establish word bag and word set models
Word set model: A collection of words. The set naturally has only one element of each element, that is, there is only one word of each word in the set .
Bag of words model: Based on the word set, if a word appears more than once in the document, count the number of times it appears (frequency).
Different: Word Bag pays more attention to how many words there are in Word Collection
#Word set model def setOfWords(vocabList,inputSet): #vocabList: Duplicate vocabulary list #inputSet: input document returnVec=[0]*len(vocabList) #Create an all-0 vector with the same length as the deduplication vocabulary list for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)]=1 #Traverse all words, if they exist, make the corresponding vector =1 else:print("The word sister~") return returnVec #bag of words model def bagOfWords(vocabList,inputSet): returnVec=[0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] + =1 return returnVec
3. Bayes’ formula
#Training Naive Bayes def trainBYS(trainMatrix,trainClasses): #trainMatrix: The matrix composed of each returnVec #trainClasses: The category corresponding to each reVec, 1 insult class 0 normal class numTrainDocs=len(trainMatrix) #Total number of documents numWords=len(trainMatrix[0]) #Total number of words in each document p_1=sum(trainClasses)/float(numTrainDocs) #The document belongs to the insult category #laplacian correction_prevent underflow p0Num=np.ones(numWords) #The number of occurrences of each word in the molecule is initialized to 1 p1Num=np.ones(numWords) p0Denom=2.0 #The total number of words in the denominator is initialized to the number of categories 2 p1Denom=2.0 for i in range(numTrainDocs): #Traverse training samples if trainClasses[i]==1: p1Num + =trainMatrix[i] #The number of words in the insult category p1Denom + =sum(trainMatrix[i]) #Total number of insult words else: p0Num + =trainMatrix[i] p0Denom + =sum(trainMatrix[i]) p1Vect=np.log(p1Num/p1Denom) #Take the logarithm p0Vect=np.log(p0Num/p0Denom) #print(p0Vect) return p0Vect,p1Vect,p_1
4. Classified documents
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): ''' Classify test documents Parameter: vec2Classify: test document vector p0Vec: conditional probability of each word in the normal class p1Vec: conditional probability of each word in the insult class pClass1: Probability of insult class in total sample p1 = sum(vec2Classify * p1Vec) + np.log(pClass1) p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1) if p1 > p0: return 1 else: return 0
5. Convert all words to lowercase
#Convert string to lowercase character list def textParse(bigString): ''' Parameter: bigString: input string Return: tok.lower(): list of lowercase characters ''' #Use special symbols as segmentation marks to segment strings, that is, non-letters and non-digits listOfTokens = re.split(r'\W + ', bigString) #Except for single letters, other words become lowercase return [tok.lower() for tok in listOfTokens if len(tok) > 2]
6. Test the classifier
#Test the naive bys classifier def spamTest(method='bag'): #method has two categories: bag word bag and set word set if method == 'bag': #Determine whether to use the bag of words model or the word set model words2Vec = bagOfWords elif method == 'set': words2Vec = setOfWords docList=[] classList=[] #Traverse folders for i in range(1,26): #Read spam and convert to multiplied string list wordList=textParse(open('C:/Users/kid/Desktop/ML/machine-learning-master/Naive Bayes/email/spam/%d.txt' % i,'r').read()) #Add the list record to the document list and classify it as 1 insult category docList.append(wordList) classList.append(1) #Read normal files wordList=textParse(open('C:/Users/kid/Desktop/ML/machine-learning-master/Naive Bayes/email/ham/%d.txt' % i,'r').read()) docList.append(wordList) classList.append(0) #Create a deduplication vocabulary vocabList=createwordList(docList) #print(vocabList) #Create index trainSet=list(range(50)) testSet=[] #split test urgent for i in range(10): # Randomly select ten from the index and remove ten from the index randIndex=int(random.uniform(0,len(trainSet))) testSet.append(trainSet[randIndex]) del(trainSet[randIndex]) #Create training set matrix and category vector trainMat=[] trainClasses=[] for docIndex in trainSet: #Add the generated model to the training matrix and record the categories trainMat.append(words2Vec(vocabList,docList[docIndex])) trainClasses.append(classList[docIndex]) #trainingbys p0V,p1V,pSpam=trainBYS(np.array(trainMat),np.array(trainClasses)) #error classifier error=0 for docIndex in testSet: wordVec=words2Vec(vocabList,docList[docIndex]) #test if classifyNB(np.array(wordVec),p0V,p1V,pSpam)!=classList[docIndex]: error + =1 #print("Wrong classification:",docList[docIndex]) #print("Error rate: %.2f%%"%(float(error)/len(testSet)*100)) errorRate=float(error)/len(testSet)#Classification error rate return errorRate
7. Call for experiments
if __name__ == "__main__": total=100 print('Train using bag-of-words model:') sum_bag_error = 0 for i in range(total): sum_bag_error + = spamTest(method = 'bag') print('Using bag-of-word model training' + str(total) + 'The average error rate obtained: ' + str((sum_bag_error / total)))
The result is:
4. A little self-judgment
After performing Naive Bayes training on existing emails, we can determine whether some of the emails we collected are spam.
content of email:
Read it as input:
#Read your own input set test testList=textParse(open('C:/Users/kid/Desktop/ML/machine-learning-master/Naive Bayes/email/testy/1.txt','r').read())
Combine it with the training deduplication vocabulary for document classification
wordVec=words2Vec(vocabList,testList)
Conduct a classification test to determine whether it is insulting category 1 or non-insulting category 0
if classifyNB(np.array(wordVec),p0V,p1V,pSpam)!=1:
Judgment result:
The error rate is 0, which is determined to be an insult.
Summary
English documents can be classified, and we will work hard to implement Chinese classification next time