1. Requirements analysis
According to the previously trained model, test the standard sample cards of different levels
There are 48 test samples, including number of pilling, total area of pilling, maximum area of pilling, average area of pilling, contrast, optical volume
six indicators, and finally determine the Level
The general structure of the data set fiber.csv
is as follows:
(The data set was collected by my own test, and I will not share it publicly here, personal data, long live understanding)
Notes on csv format:
N,S,Max_s,Aver_s,C,V,Grade
There is no space at the end
27,111542.5,38299.5,4131.2,31.91,3559537.61,1(space)
There is a space after 1, pay attention! ! !
Variable | Meaning |
---|---|
N | Number of pilling |
S | Total area of pilling |
Max_s | Maximum Pilling Area |
Aver_s | Average Pilling Area |
C | Contrast |
V | Optical Volume |
Grade | Final rating level |
2. Try multiple methods to achieve predictive rating
1. Guide package
pip install scikit-learn
Install sklearn related packages
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import BernoulliNB from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB from sklearn.svm import LinearSVC from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
2. Read the display data set
fiber = pd.read_csv("./fiber.csv") fiber. head(15)
print(fiber) """ N S Max_s Aver_s C V Grade 0 27 111542.5 38299.5 4131.20 31.91 3559537.61 1 1 27 110579.5 31220.0 3186.63 31.28 2690869.73 1 ?… 47 9 33853.0 6329.0 3761.44 41.17 1393863.42 4 """
3. Divide the data set
The last column is the outcome, and the remaining six factors are independent variables
X = fiber.drop(['Grade'], axis=1) Y = fiber['Grade']
Divide the dataset into two parts, validation set and test set
random_state
random number seed, to ensure that the training set and test set are the same each time
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)
Check the shape value
There are 36 training sets, 12 test sets, and a total of 48 data
print(X_test. shape) #(36, 6) print(y_train. shape) #(36,) print(X_test. shape) #(12, 6)
4. Different algorithm fitting
①K nearest neighbor algorithm, KNeighborsClassifier()
n_neighbors: Select the number of nearest points
Use these 4 data to fit other data
knn = KNeighborsClassifier(n_neighbors=4)
Train the fit on the training set
knn.fit(X_train,y_train)
Predict the test set X_test and get the prediction result y_pred
y_pred = knn. predict(X_test)
Compare the predicted result y_pred with the correct answer y_test, calculate the mean mean, and see the correct rate accuracy
accuracy = np.mean(y_pred==y_test) print(accuracy)
Also see the final score
score = knn.score(X_test,y_test) print(score)
Randomly generate a piece of data to test the model
16,18312.5,6614.5,2842.31,25.23,1147430.19,2
The final level is 2
test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]]) prediction = knn. predict(test) print(prediction) """ [2] """
This is extracted from the training set, and it must not be done in practice. It is just for testing.
Complete code of K nearest neighbor algorithm
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import confusion_matrix, classification_report, accuracy_score fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) knn = KNeighborsClassifier(n_neighbors=4) knn. fit(X_train, y_train) y_pred = knn.predict(X_test)#model prediction result accuracy = np.mean(y_pred==y_test)#accuracy score = knn.score(X_test,y_test)#score print(accuracy) print(score) #test test = np.array([[16,18312.5,6614.5,2842.31,25.23,1147430.19]])#A random piece of data prediction = knn.predict(test)#Bring in the data and predict it print(prediction)
②Logistic regression algorithm, LogisticRegression()
Instantiate a logistic regression object
lr = LogisticRegression()
Pass in the training set for training fitting
lr.fit(X_train,y_train)#model fitting
Predict the test set X_test and get the prediction result y_pred
y_pred = lr.predict(X_test)#model prediction result
Compare the predicted result y_pred with the correct answer y_test, calculate the mean mean, and see the correct rate accuracy
accuracy = np.mean(y_pred==y_test) print(accuracy)
Also see the final score
score = lr.score(X_test,y_test) print(score)
Randomly generate a piece of data to test the model
20,44882.5,10563,5623.88,27.15,3053651.65,1
The final level is 1
test = np.array([[20,44882.5,10563,5623.88,27.15,3053651.65]])# Randomly find a piece of data, the correct level is 1 prediction = lr.predict(test)#Bring in the data and predict it print(prediction) """ [1] """
This is extracted from the training set, and it must not be done in practice. It is just for testing.
Logistic regression complete code
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) lr = LogisticRegression() lr.fit(X_train,y_train)#model fitting y_pred = lr.predict(X_test)#model prediction results accuracy = np.mean(y_pred==y_test)#accuracy score = lr.score(X_test,y_test)#score print(accuracy) print(score) test = np.array([[20,44882.5,10563,5623.88,27.15,3053651.65]])#A random data prediction = lr.predict(test)#Bring in the data and predict it print(prediction)
③Linear support vector machine, LinearSVC()
Instantiate a linear SVM object
lsvc = LinearSVC()
Pass in the training set for training fitting
lsvc.fit(X_train,y_train)#model fitting
Predict the test set X_test and get the prediction result y_pred
y_pred = lsvc.predict(X_test)#model prediction result
Compare the predicted result y_pred with the correct answer y_test, calculate the mean mean, and see the correct rate accuracy
accuracy = np.mean(y_pred==y_test) print(accuracy)
Also see the final score
score = lsvc.score(X_test,y_test) print(score)
Randomly generate a piece of data to test the model
20,55997.5,17644.5,2799.88,8.58,480178.56,2
The final level is 2
test = np.array([[20,55997.5,17644.5,2799.88,8.58,480178.56]])#A random piece of data prediction = lsvc.predict(test)#Bring in the data and predict it print(prediction) """ [2] """
This is extracted from the training set, and it must not be done in practice. It is just for testing.
Complete code of linear support vector machine
from sklearn.svm import LinearSVC import pandas as pd import numpy as np from sklearn.model_selection import train_test_split fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) lsvc = LinearSVC() lsvc.fit(X_train,y_train)#model fitting y_pred = lsvc.predict(X_test)#model prediction results accuracy = np.mean(y_pred==y_test)#accuracy score = lsvc.score(X_test,y_test)#score print(accuracy) print(score) test = np.array([[20,55997.5,17644.5,2799.88,8.58,480178.56]])#A random piece of data prediction = lsvc.predict(test)#Bring in the data and predict it print(prediction)
④Support vector machine, SVC()
Instantiate the SVM object
svc = SVC()
Pass in the training set for training fitting
svc.fit(X_train,y_train)#model fitting
Predict the test set X_test and get the prediction result y_pred
y_pred = svc.predict(X_test)#model prediction results
Compare the predicted result y_pred with the correct answer y_test, calculate the mean mean, and see the correct rate accuracy
accuracy = np.mean(y_pred==y_test) print(accuracy)
Also see the final score
score = svc.score(X_test,y_test) print(score)
Randomly generate a piece of data to test the model
23,97215.5,22795.5,2613.09,29.72,1786141.62,1
The final level is 1
test = np.array([[23,97215.5,22795.5,2613.09,29.72,1786141.62]])#A random piece of data prediction = svc.predict(test)#Bring in the data and predict it print(prediction) """ [1] """
This is extracted from the training set, and it must not be done in practice. It is just for testing.
Complete code of support vector machine
from sklearn.svm import SVC import pandas as pd import numpy as np from sklearn.model_selection import train_test_split fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) svc = SVC(gamma='auto') svc.fit(X_train,y_train)#model fitting y_pred = svc.predict(X_test)#model prediction result accuracy = np.mean(y_pred==y_test)#accuracy score = svc.score(X_test,y_test)#score print(accuracy) print(score) test = np.array([[23,97215.5,22795.5,2613.09,29.72,1786141.62]])#A random piece of data prediction = svc.predict(test)#Bring in the data and predict it print(prediction)
⑤Decision tree, DecisionTreeClassifier()
Did you find out that the first four method steps are almost the same, but the instantiated objects are different, that’s all, so I won’t repeat them here.
Randomly generate a piece of data to test the model
11,99498,5369,9045.27,28.47,3827588.56,4
The final level is 4
Complete code of decision tree
from sklearn.tree import DecisionTreeClassifier import pandas as pd import numpy as np from sklearn.model_selection import train_test_split fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) dtc = DecisionTreeClassifier() dtc.fit(X_train,y_train)#model fitting y_pred = dtc.predict(X_test)#model prediction results accuracy = np.mean(y_pred==y_test)#accuracy score = dtc.score(X_test,y_test)#score print(accuracy) print(score) test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#A random piece of data prediction = dtc.predict(test)#Bring in the data and predict it print(prediction)
⑥Gaussian Bayesian, GaussianNB()
Randomly generate a piece of data to test the model
14,160712,3208,3681.25,36.31,1871275.09,3
The final level is 3
Gaussian Bayes complete code
from sklearn.naive_bayes import GaussianNB import pandas as pd import numpy as np from sklearn.model_selection import train_test_split fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) gnb = GaussianNB() gnb.fit(X_train,y_train)#model fitting y_pred = gnb.predict(X_test)#model prediction result accuracy = np.mean(y_pred==y_test)#accuracy score = gnb.score(X_test,y_test)#score print(accuracy) print(score) test = np.array([[14,160712,3208,3681.25,36.31,1871275.09]])#A random piece of data prediction = gnb.predict(test)#Bring in the data and predict it print(prediction)
⑦Bernoulli Bayes, BernoulliNB()
Randomly generate a piece of data to test the model
18,57541.5,10455,2843.36,30.68,1570013.02,2
The final level is 2
Bernoulli Bayes complete code
from sklearn.naive_bayes import BernoulliNB import pandas as pd import numpy as np from sklearn.model_selection import train_test_split fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) bnb = BernoulliNB() bnb.fit(X_train,y_train)#model fitting y_pred = bnb.predict(X_test)#model prediction results accuracy = np.mean(y_pred==y_test)#accuracy score = bnb.score(X_test,y_test)#score print(accuracy) print(score) test = np.array([[18,57541.5,10455,2843.36,30.68,1570013.02]])#A random piece of data prediction = bnb.predict(test)#Bring in the data and predict it print(prediction)
⑧Multinomial Bayesian, MultinomialNB()
Randomly generate a piece of data to test the model
9,64794,5560,10682.94,38.99,3748367.45,4
The final level is 4
Complete code for polynomial Bayes
from sklearn.naive_bayes import MultinomialNB import pandas as pd import numpy as np from sklearn.model_selection import train_test_split fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) mnb = MultinomialNB() mnb.fit(X_train,y_train)#model fitting y_pred = mnb.predict(X_test)#model prediction result accuracy = np.mean(y_pred==y_test)#accuracy score = mnb.score(X_test,y_test)#score print(accuracy) print(score) test = np.array([[9,64794,5560,10682.94,38.99,3748367.45]])#A random piece of data prediction = mnb.predict(test)#Bring in the data and predict it print(prediction)
Finally, by adjusting parameters and optimizing, it is determined to use the decision tree to predict the grade of this sample
5. Model saving and loading
Here we take the decision tree algorithm as an example
The model after training is saved by joblib.dump(dtc, './dtc.model')
dtc
instantiates objects for the model
./dtc.model
is to save the model name and path
Load the model via dtc_yy = joblib.load('./dtc.model')
full code
from sklearn.tree import DecisionTreeClassifier import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import joblib fiber = pd.read_csv("./fiber.csv") # Divide independent and dependent variables X = fiber. drop(['Grade'], axis=1) Y = fiber['Grade'] # Divide the dataset X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0) dtc = DecisionTreeClassifier() dtc.fit(X_train,y_train)#model fitting joblib.dump(dtc, './dtc.model')#Save the model y_pred = dtc.predict(X_test)#model prediction result accuracy = np.mean(y_pred==y_test)#accuracy score = dtc.score(X_test,y_test)#score print(accuracy) print(score) dtc_yy = joblib.load('./dtc.model') test = np.array([[11,99498,5369,9045.27,28.47,3827588.56]])#A random piece of data prediction = dtc_yy.predict(test)#Bring in the data and predict it print(prediction)
The saved model is as follows: