Implementation process of decision tree and ensemble algorithm

Decision trees and ensemble algorithms

The composition of the tree:
Root node: first selection point
Non-leaf nodes and branches: intermediate processes
Leaf node: final decision result

Entropy value: H(X)=- ∑ pi * logpi, i=1,2, … , n

Gini(S) = 1 – Σ (pk^2). The value (0, 1) 0 is the purest.

Initial entropy value: the entropy value of the result. If the result has several categories, there will be several entropy values added together.

New entropy value: A feature is divided into multiple results. The entropy value of each result is calculated in the same way as the initial entropy value, and then weighted and added.

Information gain ID3 (if a feature is divided into many results, the new entropy is 0, that is, the influence of the number of categories is not considered):

InformationGain(S, A) = Entropy(S) - H(S|A)

GainRatio(S, A) = InformationGain(S, A) / SplitInformation(A)

SplitInformation(A) = - Σ (p(t) * log2(p(t)))

Information gain rate c4.5

CART: Use the GINI coefficient as a measure, the usage is the same as information gain

The higher the information gain rate, the greater the contribution of the feature to classification.

Decision tree pruning: The risk of over-fitting in a decision tree is very high (if the tree is large enough, wouldn’t each leaf node have just one piece of data)

Pre-pruning: limit depth, number of leaf nodes, number of leaf node samples, information gain, etc.
Post-pruning: Through a certain measurement standard (the more leaf nodes, the greater the loss. Assume that the subtree of a node is pruned, replace the node with a leaf node, and set its category to the subtree with the most samples category. Improve generalization ability.

Integrated algorithm: Bagging: Stacking: Boosting

Bagging (bootstrap aggregation) takes the average of multiple parallel classifiers.

First, the training data set is sampled multiple times to ensure that the sampled data obtained each time is different.

Train multiple models separately, such as tree models

When predicting, all model results need to be obtained and then integrated.

oob_score_ is an out-of-bag evaluation score based on the bag-enabled ensemble algorithm, which is achieved by using samples that were not used during the training process (each decision tree is not used, a bit like the test set, but within the class Validation) makes predictions to evaluate the performance of the classifier

Adaboost (Adaptive Boosting Algorithm)

Increase the weight of wrongly classified samples (wrong this time, don’t make it wrong next time)

 for i in range(5):
        svm_clf = SVC(kernel='rbf',C=0.05,random_state=42)
        svm_clf.fit(X_train,y_train,sample_weight = sample_weights)
        y_pred = svm_clf.predict(X_train)
        sample_weights[y_pred != y_train] *= (1 + learning_rate)
        plot_decision_boundary(svm_clf,X,y,alpha=0.2)
        plt.title('learning_rate = {}'.format(learning_rate))

It is important to point out that the Adaboost algorithm is an iterative process that gradually improves the performance of the overall model by repeatedly training the basic classifier and adjusting sample weights. By gradually adjusting sample weights, Adaboost is able to pay more attention to misclassified samples, thereby more effectively fitting difficult-to-classify samples in the data set.

Gradient Boosting

tree_reg1 = DecisionTreeRegressor(max_depth = 2)
tree_reg1.fit(X,y)
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth = 2)
tree_reg2.fit(X,y2)
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth = 2)
tree_reg3.fit(X,y3)
#Calculate the residuals multiple times, and then superimpose multiple decision trees
It does have the flavor of gradient descent 

gbrt = GradientBoostingRegressor(max_depth = 2,
n_estimators = 120,

learning_rate = 1.0
random_state = 42
)

Multiple model image overlay

Stacking (Stacking Integration)

Stacking uses a multi-layer structure to combine models. First, the original training set is used to train an initial set of base learners. Then, using the prediction results of the basic learner as input data, another learner is trained to predict the final result.

estimators = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train, y_train)
X_val_predictions = np.empty((len(X_val), len(estimators)), dtype=np.float32)

for index, estimator in enumerate(estimators):
    X_val_predictions[:, index] = estimator.predict(X_val)
    
rnd_forest_blender.fit(X_val_predictions, y_val)

Early Stop Strategy

errors = [mean_squared_error(y_val,y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)
min_error = np.min(errors)
plt.plot([bst_n_estimators,bst_n_estimators],[0,min_error],'k--')
plt.plot([0,120],[min_error,min_error],'k--')

At this point, it will stop halfway, and then it will be overfitting.

for n_estimators in range(1,120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train,y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val,y_pred)
    if val_error < min_val_error: #The initial value of min_val_error must be very large.
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up + =1
        if error_going_up == 5:
            break

If the loss increases five times in a row, then break.

Random forest implementation logic

First think about the purpose: classify the data and build a predictive function

What’s out there currently: datasets, features, labels

What is the method: Calculate the information entropy of each feature, find the maximum information gain rate (the difference between the original entropy value and the entropy value after classification), and then use this feature as the first non-leaf node. and so on

This is the most basic component of the decision tree: {‘F2-WORK’: {0: ‘no’, 1: ‘yes’}} That is, the result looks like when F2-WORK takes 0 and 1

{‘F3-HOME’: {0: {‘F2-WORK’: {0: ‘no’, 1: ‘yes’}}, 1: ‘yes’}}

1. Calculate the initial entropy value and calculate the information gain rate of each feature separately. The largest one is the root node.

2. Recurse, find the next non-leaf node

3. Finally, there is only one feature left, or the information gain rate is negative, or the classification is completely correct and stops. Returns prediction results or category labels. Typically, this result is the category with the largest number of samples in the current node.

4. Get this decision tree

def createTree(dataset, labels, featLabels): #featLabels is an empty list
    classList = [example[-1] for example in dataset]#Get the results

    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataset[0]) == 1:
        return majorityCnt(classList) #Stop condition

    bestFeat = chooseBestFeatureToSplit(dataset)#Return an index
    bestFeatLabel = labels[bestFeat] # Pass the best features into the featLabels list
    featLabels.append(bestFeatLabel)
    myTree = {bestFeatLabel: {}} #The label enters the tree

    del labels[bestFeat] #labels Delete the selected features from the list
    featValue = [example[bestFeat] for example in dataset]
    uniqueVals = set(featValue)
#Extract all values of the best features in the data set. set removes duplicates and gets a set containing unique values.

    for value in uniqueVals:
        sublabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataset, bestFeat, value),
                                                  sublabels, featLabels)
    #Divide the data set into value parts, delete the bestFeatLabel column, and then continue to extend downwards
    return myTree

5. Integrate multiple decision trees

Random forest combines the results of multiple decision trees through ensemble learning. Specifically, Random Forest uses the following two main methods to integrate the predictions of multiple decision trees:

1. Voting: Under the majority voting strategy, each decision tree predicts the sample and votes to select the most common category as the final prediction result. This is generally applicable to classification problems.

Hard Voting: A simple vote is performed on the prediction results of each decision tree, that is, the category with the highest number of votes is selected as the final prediction result.

Soft Voting: Each decision tree will give a probability or confidence level (calculated from gini impurity, etc.) for each category. Finally, the probabilities of the predicted results of all decision trees are averaged and the category with the highest probability is selected. As the final prediction result

2. Averaging: For regression problems, each decision tree will give a continuous prediction value, and the random forest takes the average of the prediction values of all decision trees as the final prediction result.

(Application of decision trees in regression problems)

6. Model evaluation and optimization: The performance of random forest can be evaluated through methods such as cross-validation. Hyperparameters, such as the number of decision trees and the maximum depth of each decision tree, can be adjusted as needed to further optimize the performance of the random forest model.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56925 people are learning the system