Decision tree: the path of wisdom from root to leaf

Article directory

  • Preface
  • 1. What is a decision tree?
    • 1. Classification trees and regression trees
    • 2. Basic concepts
  • 2. Basic algorithmic ideas of decision trees
    • How to choose the splitting condition of a certain node?
      • 1. Information Entropy
      • 2. Information Gain (ID3 algorithm)
      • 3. Gain Ratio (C4.5 algorithm)
      • 4. Gini Index (CART algorithm)
    • When to stop splitting?
  • 3. Construction process of decision tree
  • 4. Python implementation of decision tree
  • 5. Application areas of decision trees
  • 6. Advantages and disadvantages of decision trees
    • Advantage
    • insufficient
  • 7. Improvement and expansion of decision trees
  • Conclusion

Foreword

Decision Tree is a powerful machine learning algorithm based on a tree structure and is widely used in classification and regression problems. It works similar to the human thinking process when making decisions, gradually separating data through a series of questions and finally making a prediction or decision. This blog will delve into the basic concepts of decision trees, algorithmic ideas of regression trees, and metrics for measuring uncertainty, while providing examples to enhance understanding.

1. What is a decision tree?

Decision tree is a supervised learning algorithm based on a tree structure used to solve classification and regression problems to make decisions or predictions from a set of data points. It splits data through a series of questions or conditions until a final decision or result is reached. There are two main types of decision trees: classification trees and regression trees.

1. Classification tree and regression tree

  • Classification trees are used to solve classification problems by dividing a data set into different categories. For example, a classification tree can be used to predict whether an email is spam or not spam, or to determine whether a fruit is an apple or an orange.

  • Regression trees are used to solve regression problems and predict a continuous value. For example, a regression tree can be used to predict the sales price of a house based on its characteristics.

2. Basic concepts

Before understanding decision trees, let us familiarize ourselves with some basic concepts:

  • Root Node: The starting node of the decision tree, including the entire data set.

  • Leaf/Terminal Node: The terminal node of the decision tree, which is no longer split, represents the final decision or prediction.

  • Branch: Directed edges connecting nodes, representing the separation of data according to different conditions.

  • Depth: The hierarchical depth of the tree, indicating the number of levels from the root node to the leaf nodes, reflecting the complexity of the tree. The root node has a depth of 0, and the depth increases by 1 for each lower level.

2. Basic algorithm idea of decision tree

The core idea of a decision tree is to reduce uncertainty by recursively splitting the data set into purer subsets, and ultimately achieve the goal of classification or regression. Key issues include how to select the splitting conditions for a node and when to stop splitting.

How to select the split condition of a node?

In order to select the segmentation conditions of a node, we need to measure the uncertainty of the data. The following are several commonly used uncertainty measurement indicators:

1. Information entropy Entropy

Information entropy is a measure of how confusing the data is. For a node, the calculation formula of information entropy is as follows:

E

n

t

(

D

)

=

?

i

=

1

c

p

i

log

?

2

(

p

i

)

Ent(D) = -\sum_{i=1}^{c} p_i \log_2(p_i)

Ent(D)=?∑i=1c?pi?log2?(pi?)

in,

D

D

D is the data set on the node,

c

c

c is the number of categories,

p

i

p_i

pi? is the probability of each category.

2. Information Gain (ID3 algorithm)

Information gain measures the reduction in information entropy after splitting data by a certain feature. For a node, the information gain is calculated as follows:

G

a

i

n

(

D

,

A

)

=

E

n

t

(

D

)

?

v

V

a

l

u

e

s

(

A

)

D

v

D

?

E

n

t

(

D

v

)

Gain(D, A) = Ent(D) – \sum_{v \in Values(A)} \frac{|D_v|}{|D|} \cdot Ent(D_v)

Gain(D,A)=Ent(D)?∑v∈Values(A)?∣D∣∣Dv?∣Ent(Dv?)

in,

D

D

D is the data set on the parent node,

A

A

A is the selected feature,

D

v

D_v

Dv? is based on the characteristics

A

A

Every value of A

v

v

v Split sub-dataset.

3. Gain Ratio (C4.5 algorithm)

Gain ratio is a trade-off between information gain and split information. Its calculation formula is as follows:

G

a

i

n

_

r

a

t

i

o

(

D

,

A

)

=

G

a

i

n

(

D

,

A

)

I

V

(

D

,

A

)

Gain\_ratio(D, A) = \frac{Gain(D, A)}{IV(D, A)}

Gain_ratio(D,A)=IV(D,A)Gain(D,A)?

in,

I

V

(

D

,

A

)

=

?

v

V

a

l

u

e

s

(

A

)

D

v

D

log

?

2

(

D

v

D

)

IV(D, A) = -\sum_{v \in Values(A)} \frac{|D_v|}{|D|} \log_2(\frac{|D_v|}{|D| } )

IV(D,A)=?∑v∈Values(A)?∣D∣∣Dv?∣?log2?(∣D∣∣Dv?∣?) .

4. Gini Index (CART algorithm)

The Gini index is a measure of data impurity. For a node, the calculation formula of the Gini index is as follows:

G

i

n

i

(

D

)

=

i

=

1

c

p

i

?

(

1

?

p

i

)

=

1

?

i

=

1

c

(

p

i

)

2

Gini(D) = \sum_{i=1}^{c} p_i \cdot (1-p_i) = 1 – \sum_{i=1}^{c} (p_i)^2

Gini(D)=∑i=1c?pi(1?pi?)=1?∑i=1c?(pi?)2

in,

D

D

D is the data set on the node,

c

c

c is the number of categories,

p

i

p_i

pi? is the probability of each category.

When to stop splitting?

In order to prevent over-growth of the tree (over-fitting), stopping segmentation is a key issue in decision tree construction. There are usually the following strategies:

  1. The depth of the tree reaches the predetermined value: Set a maximum depth and stop splitting when the tree reaches this depth.
  2. The number of samples in a node is less than the threshold: If the number of samples in a node is less than a predetermined threshold, stop the growth of the tree.
  3. All samples belong to the same category: If the samples on a node all belong to the same category, stop splitting and set the node as a leaf node.
  4. The impurity (uncertainty indicator) reaches the threshold: When the impurity of a node is lower than a certain threshold, the segmentation can be stopped and the node is considered pure enough.
  5. Early termination: Use methods such as cross-validation to determine when to stop splitting based on model performance.

The strategy for stopping segmentation is usually chosen based on the specific problem and data set. Through these stopping conditions, the growth of the decision tree can be controlled so that it can avoid overfitting while maintaining the predictive ability, so as to fully balance the fitting ability and generalization of the model. ability.

3. Decision tree construction process

The establishment process of a decision tree mainly includes the following steps:

  1. Select the best features: At each step, the best features are selected to split the data set based on some criterion (such as information gain, Gini coefficient, etc.). Generally, the selected features should make the subset more pure, that is, samples of the same category are clustered together as much as possible.

  2. Split node: According to the selected optimal feature, the current node is split into several sub-nodes. Each sub-node corresponds to a value of the optimal feature. The data set is also divided into multiple subsets, each A subset corresponds to a characteristic value or attribute.

  3. Recursive tree building: Repeat steps 1 and 2 for each child node until the stopping condition is met (such as the depth of the tree reaches a predetermined value or the number of samples contained by the node is less than the threshold).

  4. Assign categories or values to leaf nodes: Once the structure of the tree is established, each leaf node is assigned a category label (classification problem) or numerical value (regression problem). When a new sample enters the decision tree, According to the value of the feature, walk along the tree structure to the leaf node, and you can get the prediction result, which will become the prediction output of the tree.

4. Python implementation of decision tree

We can implement a simple decision tree model using Python to understand how the decision tree algorithm works. We will use the scikit-learn library to build and train a decision tree model.

First, we need to load the dataset. In this example we will use the Iris dataset, which contains three different types of iris flowers. We will try to predict the type of iris flower using the length and width of the petals and sepals.

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

We then split the dataset into training and test sets and used the training set to train our decision tree model. In this example, we will build a decision tree using the default parameters (based on Gini impurity).

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

Next, we can use the test set to evaluate the performance of the model.

from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Finally, we can visualize the decision tree model to better understand how it makes classification decisions.

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
plot_tree(clf,
          feature_names=iris.feature_names,
          class_names=iris.target_names,
          filled=True)
plt.show()

The output visual decision tree is shown in the figure below:
Decision tree example

For convenience, the complete code is as follows:

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

#Load the Iris data set
iris = load_iris()
X = iris.data
y = iris.target

# Split the data set into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Create a decision tree classifier and train it using training data
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Use the test set to evaluate model performance
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#Visual decision tree model
plt.figure(figsize=(12, 8))
plot_tree(clf,
          feature_names=iris.feature_names,
          class_names=iris.target_names,
          filled=True)
plt.show()

5. Application areas of decision trees

Decision trees are widely used in many fields, including but not limited to:

  • Medical Diagnosis: Used to predict disease or disease risk based on a patient’s symptoms and test results.

  • Finance: Used for credit scoring, fraud detection, and investment decisions.

  • Marketing: Used for customer segmentation, sales forecasting, and product recommendations.

  • Ecology: Used for species classification, ecosystem analysis, and environmental monitoring.

  • Industrial Manufacturing: For quality control, equipment failure detection and production optimization.

6. Advantages and disadvantages of decision trees

Advantages

  • Relatively easy to understand and explain, with good visualization.
  • Ability to handle mixed data types, including numeric and categorical features.
  • Not much data preprocessing is required.
  • Ability to handle missing data.
  • It is suitable for small to medium-sized data sets and has fast training speed.

Insufficient

  • For some complex problems, it is possible to overfit the data, resulting in poor generalization performance.
  • Sensitive to noise and outliers in the data.
  • Continuous features often cannot be handled well and require binning.

7. Improvements and extensions of decision trees

In order to overcome some of the shortcomings of decision trees, ensemble methods such as random forests, gradient boosting trees, and XGBoost can be used. These algorithms combine multiple decision trees to improve accuracy and robustness.

Conclusion

Decision trees are a flexible and powerful algorithm suitable for a variety of machine learning problems. By understanding the basic concepts and working principles of decision trees, and deepening our understanding through examples, we can better apply this algorithm to solve practical problems.

If you want to learn more about decision tree algorithms, you can explore related programming libraries and tools such as Scikit-Learn and XGBoost to put this powerful tool into practice. In practical applications, we can also optimize the decision tree by adjusting hyperparameters, pruning, etc., so that it can better adapt to different data sets.

I hope this blog has been helpful in understanding and applying the decision tree algorithm, if you have any questions or need further information, please feel free to ask.