Decision Tree Algorithm: How to Choose Contact Lenses


According to some data of the above case, we can see that the final decision on whether to wear glasses and what material to choose depends on the various data we need before, and then decide what material to choose after layer-by-layer screening.

Through the decision tree shown in the figure above, we can see that the root node of the tree is at the top, and many leaves are at the bottom. Through step-by-step screening, we finally decide what material to choose. After one condition is met, we start to filter the next condition.


Entropy is a measure of the uncertainty of the random variable Y. Its calculation formula is shown in the figure. If the probability is 0, then we stipulate that its entropy is 0.

The greater the uncertainty or confusion of information, the greater its entropy. If an event is inevitable, its entropy is 0. If there are several possibilities for an event to occur, its entropy is greater. Because of its greater degree of confusion.



Here is an example method for calculating entropy and conditional entropy. When calculating entropy directly, we can see that there are three categories, each of which is based on the overall probability. When calculating conditional entropy, two categories are first divided according to X, and then in Under the two categories, the probability of whether it is suitable for wearing is calculated separately.

For category Y, we have a total entropy, and a conditional entropy is also calculated based on a certain feature. The information gain value is the total entropy minus the conditional entropy. Under the current conditions, the degree to which the total entropy is reduced is The more the selected condition is, the more effective it is.

The basic idea of a decision tree is a greedy algorithm, which constructs a decision tree in a top-down recursive manner, and takes the best choice in the current state in each step of selection, that is, the choice with the largest information gain value.



An example of the ID3 algorithm is to choose information gain as the criterion. When extending downward from the root node, if you are not sure which feature to choose first, you can calculate the information gain value under each feature. The one with the largest information gain value is The node to select.

Next, we will start our decision tree code display part, and there are still many things to pay attention to here.

'''step1 call package'''
import pandas as pd
import numpy as np
import matplotlib as mpl
from sklearn.model_selection import train_test_split #Call the data set division function
from sklearn.metrics import accuracy_score #call accuracy calculation function
from sklearn import tree
# from sklearn.externals.six import StringIO
# StringIO will report an error when called in sklearn, it should be called from io
from io import StringIO
import pydotplus

'''step2 import data'''
data = pd.read_excel('data_lenses.xlsx')

What we should pay attention to here is the call of StringIO, please pay attention, and the data read in is like this.

'''step3 data preprocessing'''
Names_class = 'not suitable for', 'hard material', 'soft material'
# Recoding of discrete variables
for feature in data.columns:
    if data[feature].dtype == 'object':
        data[feature] = pd. Categorical(data[feature]). codes
X,y = data.iloc[:,:4],data.iloc[:,4]

#generation
#0-pre-senile, 1-old, 2-young
#symptom
#0-myopia, 1-high pressure eye,
#whether astigmatism
#0-no, 1-yes
#Number of tears
#0 - reduced tears, 1 - normal tears

#Is it suitable to wear
#0-not suitable, 1-hard material, 2-soft material

Convert the categories in the data into numbers and re-encode them. This is how the conversion is done.

'''step4 Divide the dataset'''
#(Divide the class label data# into training set (80%) and test set (20%)
# Divide the data set into training set and test set
X_train,X_test,y_train,y_test = train_test_split(X,y
                    ,test_size=0.2,random_state=3)

'''step5 model calculation (training, testing, evaluation)'''
#step5.1 Training model (ID3 algorithm)
#Create the DecisionTreeClassifier() class
model_dt = tree.DecisionTreeClassifier(criterion='entropy')

model_dt.fit(X, y) #Use data to build a decision tree

#step5.2 Test the model
y_pred = model_dt. predict(X_test)
print('test accuracy:', accuracy_score(y_test, y_pred))

Divide the data set and perform model training and testing at the same time. The more complicated thing below is the visualization of the decision tree.

#Use Python to call AT&T laboratory open source drawing tool GraphViz software to realize the visualization of decision tree
# Refer to https://blog.csdn.net/qq_42700429/article/details/82927961
#graphviz-2.38.msi
#Record its installation path, find the bin path, such as E:\Graphviz2.38\bin
# (Note that it is a backslash, the following is changed to a slash

For data visualization, the GraphViz drawing tool is used here. After installation, find its installation path as shown above and execute the following code at the same time. Remember to change your own working path.

import os
os.environ["PATH"] + = os.pathsep + 'D:\Program Files\Graphviz\bin'#This is my path
#step5.3 Visualizing Decision Trees
mpl.rcParams['font.sans-serif'] = ['simHei']
mpl.rcParams['axes.unicode_minus'] = False
dot_data = StringIO()
tree.export_graphviz(model_dt, out_file = dot_data, #draw decision tree
feature_names = X_train.keys(),
class_names = Names_class,
filled=True, rounded=True,
special_characters=True)

graph = pydotplus.graph_from_dot_data(
        dot_data. getvalue())
from IPython.display import Image
Image(graph. create_png())

By executing the last line of the Image command, you can see the drawn decision tree. There was a code operation for font conversion in this code before, but an error will be reported after execution, so I removed it. You can share with Compare the code above.

graph = pydotplus.graph_from_dot_data(
        dot_data.getvalue().replace('helvetica','"Microsoft YaHei"'))


At the same time, we can not only generate directly, but also generate pdf for viewing.

graph.write_pdf("Contact lens selection decision tree model.pdf")

Just execute this code, but don’t choose the one with converted fonts above!

'''step6 prediction result'''
# After the model is trained and the test effect is satisfactory, you can check the last 9 lines
# Unknown category samples for prediction (classification)
print('\\
Prediction result: (0-unsuitable, 1-hard material, 2-soft material)\\
'
      ,model_dt.predict([[0,0,0,1]]))

The last stage is to make predictions. We have the basic categories of input, and finally we can see the predicted results according to the categories we input.