Industrial steam volume forecast (Speed 3)

Industrial steam volume forecast (3)

Feature optimization
- 1 Feature optimization method
- - 1.1 Synthetic features
  - 1.2 Simple transformation of features
  - 1.3 Use decision trees to create new features
  - 1.4 Feature combination
Model fusion
- 1Model optimization
- - 1.1 Model learning curve
  - 1.2 Model fusion improvement technology
  - 1.3 Prediction result fusion strategy
  - 1.4 Other improvement methods

Feature optimization

1 Feature optimization method

Features can be optimized by synthesizing features, performing simple transformations on features, using decision trees to create new features, and feature combinations.

1.1 Synthetic Features

Synthetic features refer to features that are not included in the input features, but are derived from one or more input features. Features created individually through normalization or scaling are not synthetic features.

Synthetic features include the following types:

(1) Multiply a feature with itself or other features (called feature combination).

(2) Two featuresdivide.

(3) Bucket (binning) the continuous features to divide them into multiple intervals and binning.

1.2 Simple transformation of features

Transformation and combination of numerical features:

Any monotonic transformation (such as logarithm) for individual feature columns is not suitable for decision tree algorithms.

Linear combination of features is only suitable for decision trees and ensemble learning algorithms based on decision trees (such as Gradient Boosting, Random Forest), because tree models are not good at capturing the correlation between different features. Models such as SVM, linear regression, and neural networks themselves can be linearly combined.

Commonly used transformations and combinations of numerical features are as follows:

Combination of categorical and numerical features:

Just by effectively combining the existing categorical features and numerical features as above, a large number of excellent available features can be added. If you combine this method with basic feature engineering methods such as linear combination (only for decision trees), you can get more meaningful features, as follows:

//Characteristic construction
epsilon=1e-5
func_dict = {<!-- -->
    'add':lambda x,y:x + y,
    'mins':lambda x,y:x - y,
    'div':lambda x,y:x/(y + epsilon),
    'multi':lambda x,y:x * y,
}

def auto_features_make(train_data, test_data,func_dict, col_list):
    train_data, test_data = train_data.copy(),test_data.copy()
    for col_i in col_list:
        for col_j in col_list:
            for func_name,func in func_dict.items():
                for data in [train_data,test_data]:
                    func_features = func(data[col_i], data[col_j])
                    col_func_features = '-'.join([col_i, func_name, col_j])
                    data[col_func_features] = func_features
    return train_data, test_data

1.3 Use decision trees to create new features

1.4 Feature Combination

Feature combinations are composite features formed by combining (multiplication or Cartesian product) individual features, which help represent non-linear relationships.

Encoding non-linear patterns:

Combined one-hot vectors:

Train the model using bucketed feature columns:

Bucket feature: It divides continuous numerical features into different buckets (boxes) in a certain way, which can be understood as a discretization method for continuous features.

For example, we can divide the population characteristics of a certain place into the following three buckets:

bucket_ 0 (< 5000): Corresponds to neighborhoods with a small population distribution.
bucket_ 1 (5000~ 25000): Corresponds to blocks with moderate population distribution.
Bucket 2 (> 25000): Corresponds to blocks with a large population distribution.

Model fusion

1Model optimization

After selecting a model for training, how to judge the quality of the model and optimize the performance of the model?

Generally, optimization can be carried out from the following aspects: studying the model learning curve, judging whether the model is overfitting or underfitting and making corresponding adjustments; analyzing the model weight parameters, for features with high or low absolute weights, you can For more detailed work, you can also perform feature combination: perform Bad-Case analysis, and determine whether there is still room for modification and mining of wrong examples: perform model fusion.

1.1 Model Learning Curve

The occurrence of high deviation indicates that the model is too simple and cannot learn the underlying patterns of the sample. At this time, the accuracy of the training set and verification set will be very low.

The occurrence of high variance indicates that the model is too complex and has learned too much. The accuracy on the training set is good, but the generalization ability on the verification set is poor. The accuracy on the verification set is low. The difference between the two accuracy rates is relatively small. big.

1.2 Model Fusion Improvement Technology

Model fusion is to first generate a group of individual learners and then combine them using a certain strategy to enhance the model effect.

According to the relationship between individual learners, model fusion improvement technology can be divided into two categories:

(1) Parallelization methods that can be generated simultaneously without strong dependencies between individual learners, represented by the Bagging method and random forest.

(2) There is a strong dependency between individual learners and a serialization method that must be generated serially, represented by the Boosting method.

Bagging methods and random forests:

The Bagging method samples the sub-training set required by each base model from the training set, and then synthesizes the prediction results of all base models to produce the final prediction result, as shown in Figure 1-7-3.

The Bagging method uses Bootstrap sampling, that is, for the original training set of m samples, one sample is randomly collected and put into the sampling set each time, and then the sample is put back, that is to say, the sample should be used the next time the sample is sampled. It is still possible for samples to be collected, and by collecting m times in this way, a sampling set of m samples can finally be obtained. Since it is randomly sampled, each sampling set is different from the original training set and different from other sampling sets, so that multiple different weak learners can be obtained.

Random forest is an improvement on the bagging method. There are two improvements: the basic learner is limited to decision trees; in addition to adding perturbations to bagging samples, perturbations are also added to the attributes, which is equivalent to learning in decision trees. Random attribute selection is introduced in the process. For each node of the base decision tree, first randomly select a subset containing k attributes from the attribute set of the node, and then select an optimal attribute from this subset for partitioning.

#Integrate the three models of LinearRegression, LGB, and RandomForestRegressor
# 3 model fusion
def model_mix(pred_1, pred_2, pred_3):
    result = pd.DataFrame(columns=['LinearRegression', 'LGB', 'RandomForestRegressor', 'Combine'])

    for a in range(10):
        for b in range(10):
            for c in range(1,10):
                test_pred = (a * pred_1 + b * pred_2 + c * pred_3) / (a + b + c)

                mse = mean_squared_error(test_target, test_pred)

                result = result.append([{<!-- -->'LinearRegression': a,
                                         'LGB': b,
                                         'RandomForestRegressor': c,
                                         'Combine': mse}],
                                       ignore_index=True)
    return result


model_combine = model_mix(linear_predict, LGB_predict, RandomForest_predict)

model_combine.sort_values(by='Combine', inplace=True)
print(model_combine.head())

Boosting method:

The training process of the Boosting method is ladder-like, that is, the base models are trained one by one in order (parallel implementation can be achieved), the training set of the base model is transformed each time according to a certain strategy, and then the predictions of all base models are The results are linearly integrated to produce the final prediction result, as shown in Figure 1-7-4.

The famous algorithms in the Boosting method include the AdaBoost algorithm and the Boosting Tree (Boosting Tree) series of algorithms. Among the boosting tree series algorithms, the most widely used one is Gradient Boosting Tree, which are briefly introduced below.

(1) AdaBoost algorithm: It is an additive model, the loss function is an exponential function, and the learning algorithm is a forward distribution algorithm.
Binary classification algorithm.

(2) Boosting tree: It is an algorithm when the additive model and the learning algorithm are forward distribution algorithms, and the basic learner is limited to a decision tree. For binary classification problems, the loss function is an exponential function, which limits the basic learner in the AdaBoost algorithm to a binary decision tree: for regression problems, the loss function is the squared error, and what is fitted at this time is the residual of the current model.

(3) Gradient boosting tree: It is an improvement on the boosting tree algorithm. The boosted tree algorithm is only suitable for error functions that are exponential functions and squared errors. For general loss functions, the gradient boosted tree algorithm can use the value of the negative gradient of the loss function in the current model as an approximation of the residual.

model = 'GradientBoosting'
metal_models[model] = GradientBoostingRegressor()

param_grid = {<!-- -->'n_estimators':[150,250,350],
              'max_depth':[1,2,3],
              'min_samples_split':[5,6,7]}

metal_models[model], cv_score, grid_results = train_model(metal_models[model], param_grid=param_grid, X=metal_x_train,y=metal_y_train,
                                              splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)

model = 'XGB'
metal_models[model] = XGBRegressor()

param_grid = {<!-- -->'n_estimators':[100,200,300,400,500],
              'max_depth':[1,2,3],
             }

metal_models[model], cv_score,grid_results = train_model(metal_models[model], param_grid=param_grid, X=metal_x_train,y=metal_y_train,
                                              splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)

1.3 Prediction result fusion strategy

Voting:

Voting (voting mechanism) is divided into two types: soft voting and hard voting. Its principle adopts the idea of the minority obeying the majority. This method can be used to solve classification problems.

(1) Hard voting: Directly vote on multiple models, and the class with the most votes is the class that is finally predicted.

(2) Soft voting: The principle is the same as hard voting. It adds the function of setting weights and can set different weights for different models.
weight, thereby distinguishing the different importance of the model.

Averaging and Ranking:

The principle of Averaging is to use the average of the model results as the final predicted value, and a weighted average method can also be used. But there are also problems: if the fluctuation range of the prediction results of different regression methods is relatively large, then the regression results with small fluctuations will play a relatively small role in the fusion.

The idea of Ranking is consistent with that of Averaging. Because there are certain problems with the above average method, the method of averaging the rankings is adopted here. If there is a weight, find the sum of the weight ratio rankings of n models, which is the final result.

Blending:

Blending is to first divide the original training set into two parts, such as 70% of the data as the new training set and the remaining 30% of the data as the test set.

In the first layer, we use 70% of the data to train multiple models, and then predict the labels of the remaining 30% of the data. In the second layer, just use 30% of the data predicted in the first layer as new features to continue training.

Advantages of Blending: Blending is simpler than Stacking (no k-fold cross-validation is needed to obtain stacker features), and it avoids some information leakage problems because generalizers and stackers use different data sets.

Disadvantages of Blending:
(1) Very little data is used (the blender in the second stage only uses 10% of the data in the training set).
(2) blender may overfit.

Note: In terms of practical results, Stacking and Blending have similar effects.

Stacking:

The basic principle of stacking is to use all trained base models to predict the training set, and use the predicted value of the j-th base model for the i-th training sample as the j-th feature value of the i-th sample in the new training set. Finally, Train based on the new training set. In the same way, the prediction process must first go through the predictions of all base models to form a new test set, and finally predict the test set, as shown in Figure 1-7-5.

Stacking is a hierarchical model integration framework. Take two layers as an example: the first layer is composed of multiple base learners, and its input is the original training set; the second layer model is trained using the output of the first layer base learner as the training set, thereby obtaining a complete Stacking model.

Both Stacking two-layer models use the entire training set data.

Note: During the stacking process, if the predicted values and original features of the first-layer model are merged into the training of the second-layer model, the model can be more effective and prevent over-fitting of the model.

1.4 Other improvement methods

Through analysis of weights or feature importance, you can accurately find important data and fields and related feature directions, and you can continue to refine in this direction and look for more data in this direction. Related feature combinations can also be made, which can improve the performance of the model.

Through Bad-Case analysis, you can effectively find sample points with inaccurate predictions, and then analyze the data back to find the relevant reasons, so as to find ways to improve the accuracy of the model.