Python implements spam SMS data identification and classification based on data mining

Table of contents
Summary 1
1. Overview 2
2. Related work 2
3. Data Analysis 3
4. Research Method 3
4.1 Logistic Regression 3
4.2 Support Vector Machine (SVM) 4
4.3 Decision Tree 5
4.4 Gradient boosting decision tree (GBDT) 6
Then the negative gradient error at this time is 6
5. Experimental design 7
5.1 Logistic Regression Model 7
5.1.1 Logistic regression classifier training 7
5.1.2 Experimental results and analysis 7
5.2 Support Vector Machine model 8
5.2.1 Data preprocessing 8
5.2.2 Training model 8
5.2.3 Model evaluation 9
5.3 Decision Tree model 9
5.3.1 Decision tree classifier training 9
5.3.2 Experimental results and analysis 10
5.4 Gradient boosting decision tree (GBDT) model 11
5.4.1 GBDT classifier training 11
5.4.2 Experimental results and analysis 11
5.5 Spam SMS identification system 12
6. Summary 13
Reference 13
Text classification can be traced back to the 1960s. Before that, manual classification methods were mainly used. After entering the 1960s, Maron published the landmark paper “Automatic Indexing: An Experimental Inquiry” [2], using Bayesian formula for text classification, which greatly promoted text classification work. In this article, Maron also assumed that features are independent of each other, which was the “Bayesian assumption” that was later widely adopted.
In the following two decades, the method of knowledge engineering (KE) was mainly used for text classification, which built a classifier by manually establishing a series of classification rules based on expert knowledge. The knowledge engineering method requires the participation of experts and engineers in a large number of fields, which is bound to consume a lot of manpower and material resources. When electronic documents grow rapidly, it will not be able to meet the demand. The most typical application example of this method is the CONSTRUE system [3] developed by Carnegie Group, which is used to automatically classify Reuters press releases.
Until the 1990s, with the rapid development of the Internet, in order to better process a large number of electronic documents, and with the development of artificial intelligence, machine learning, pattern recognition, statistical theory and other disciplines, text classification based on knowledge engineering was The method gradually withdrew from the stage of history, and text classification technology entered a deeper era of automatic classification. Because the automatic text classification system based on machine learning can achieve almost the same accuracy as human experts, but does not require the intervention of any knowledge engineers or domain experts, it saves a lot of manpower, and the classification efficiency is much higher than that of human experts.
Commonly used text classification algorithms mainly include three categories. One type is classification algorithms based on probability and information theory, such as Naive Bayes algorithm (Naive Bayes), Maximum Entropy algorithm (Maximum Entropy), etc.; the other type is classification algorithms based on TFIDF weight calculation method, such algorithms include Rocchio algorithm, TFIDF algorithm, k Nearest Neighbors algorithm, etc.; the third category is classification algorithms based on knowledge learning, such as Decision Tree, Artificial Neural Networks, Support Vector Machine Machine), logistic regression model (Logistic Regression) and other algorithms.
3. Data analysis
Before classifying the data, we first analyzed the data. A total of 800,000 pieces of labeled data were classified into this spam text message, of which 80,000 were junk data and the rest were non-spam data. It can be seen that the problem of imbalance between positive and negative samples in the data is very serious.
Considering that the upsampling method will lead to overfitting, and downsampling will waste too much experimental data. Faced with this situation, our idea is to change the cost of misclassified data. Different weights are assigned to different data, so that the misclassification costs of different categories are different. We give a higher weight to spam text messages to make them more costly to misclassify during the classification process. Based on experience, we set the weight ratio to 9:1.

# -*- coding: utf-8 -*-
# @Date : 2018/11/07
# @Author: xiaoliang8006

import jieba
import jieba.posseg as pseg
import sklearn.feature_extraction.text
from sklearn.externals import joblib
import cPickle as pickle
from scipy import sparse, io
import sys,os
from time import time
import warnings
# generate word vector using tf-idf weight
class TfidfVectorizer(sklearn.feature_extraction.text.TfidfVectorizer):
    def build_analyzer(self):
        def analyzer(doc):
            words = pseg.cut(doc)
            new_doc = ''.join(w.word for w in words if w.flag != 'x')
            words = jieba.cut(new_doc)
            return words
        return analyzer

#getting information***************************
gpus = sys.argv[1]
text = [gpus];
# Module 2: Prediction information
# Read the short information to be predicted into X1
X1 = []
X2 = []
#f = open('test.txt')
X1.append(gpus)
# Carry out word segmentation and save it in X2 after segmentation
for i in range(len(X1)):
    words = pseg.cut(X1[i])
    str1 = ""
    for key in words:
        str1 + = key.word
        str1 + = ' '
    X2.append(str1) # SMS content

#******************************LR****************** ***************
start2 =time()
warnings.filterwarnings("ignore")# Ignore some warnings such as version incompatibility
model=joblib.load("SPAM_CLASSIFY_online/model/LR_model.m")
vectorizer = joblib.load("SPAM_CLASSIFY_online/Data/Myvectorizer.m")#Instancing
x_demand_prediction = vectorizer.transform(X2)
y_predict = model.predict(x_demand_prediction)
end2 =time()
# output
for i in range(len(X1)):
    if int(y_predict[i]) == 0:
        print(" & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp;<font size=5 weight=700> LR: </font> <font color=green size=5 weight=700>Non-spam message</font> Time: %0.3fs</br>" % (end2 - start2))
    else:
        print(" & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp;<font size=5 weight=700> LR: </font> <font color=red size=5 weight=700>Spam text message! </font> Time used: %0.3fs</br>" % (end2 - start2))


\t\t\t\t
#******************************DT********************** **
start4 =time()
model=joblib.load("SPAM_CLASSIFY_online/model/dtree_py2_final.m")
vectorizer = joblib.load("SPAM_CLASSIFY_online/Data/tfidf_py2_final.m")#instantiation
x_demand_prediction = vectorizer.transform(X2)
y_predict = model.predict(x_demand_prediction)
end4 =time()
# output
for i in range(len(X1)):
    if int(y_predict[i]) == 0:
        print(" & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp;<font size=5 weight=700> DT: </font> <font color=green size=5 weight=700>Non-spam message</font> Time: %0.3fs</br>" % (end4 - start4))
    else:
        print(" & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp; & amp;nbsp;<font size=5 weight=700> DT: </font> <font color=red size=5 weight=700>Spam text message! </font> Time used: %0.3fs</br>" % (end4 - start4))


#******************************SVM********************** ***************************
start=time()
vec_tfidf = joblib.load("SPAM_CLASSIFY_online/Data/vec_tfidf") #note absolute path
data_tfidf = vec_tfidf.transform(text)
#data_tfidf = vec_tfidf.fit_transform(text)
#model = pickle.load(open('model/SVM_sklearn.pkl', 'rb'))
modelb = joblib.load('SPAM_CLASSIFY_online/model/SVM_sklearn.pkl')

predict = modelb.predict(data_tfidf)
end=time()
if predict == "0":
    print(" & amp;nbsp;<font size=5 weight=700> SVM: </font> <font color=green size=5 weight=700>Non-spam SMS</font> Time used: %0.3fs</br >" % (end - start))
elif predict == "1":
    print(" & amp;nbsp;<font size=5 weight=700> SVM: </font> <font color=red size=5 weight=700>Spam SMS! </font> Time used: %0.3fs</br >" % (end - start))

\t
\t\t
#******************GBDT********************************** *******
start3 =time()
model = joblib.load('SPAM_CLASSIFY_online/model/gbdt_s.pkl')
vectorizer = joblib.load("SPAM_CLASSIFY_online/Data/vec_tfidf_s")#instantiation
x_demand_prediction = vectorizer.transform(X2)
y_predict = model.predict(x_demand_prediction)
end3 =time()
# output
for i in range(len(X1)):
    if int(y_predict[i]) == 0:
        print("<font size=5 weight=700> GBDT: </font> <font color=green size=5 weight=700>Non-spam text message</font> Time used: %0.3fs</br>" % (end3 - start3))
    else:
        print("<font size=5 weight=700> GBDT: </font> <font color=red size=5 weight=700>Spam SMS! </font> Time used: %0.3fs</br>" % (end3 - start3))