64-bit system demonstration based on WIN10
1. Write in front
In this issue, we introduce the return of Catboost.
Again, using this data here:
Demonstration of public data from a 2015 article in “PLoS One” entitled “Comparison of Two Hybrid Models for Forecasting the Incidence of Hemorrhagic Fever with Renal Syndrome in Jiangsu Province, China”. The data is the monthly incidence rate of hemorrhagic fever with renal syndrome in Jiangsu Province from January 2004 to December 2012. Data from January 2004 to December 2011 were used to predict incidence data for the 12 months of 2012.
2. Catboost returns
(1) Parameter interpretation
Whether it is regression or classification, most of CatBoost’s parameters are common, but the different nature of the tasks means that some parameters may only be meaningful in one task.
Here is a brief overview of some key parameters:
(a) General parameters:
learning_rate: The learning rate determines the step size of each step of the model. Commonly used values are 0.01, 0.03, 0.1, etc.
iterations: the number of trees.
depth: The depth of the tree.
l2_leaf_reg: Coefficient of L2 regularization term.
cat_features: column index list of categorical features.
loss_function: loss function. For classification, the common ones are Logloss (two categories) or MultiClass (multiclass). For regression, a common one is RMSE.
border_count: The number of bins used for numerical features. Higher values may result in overfitting, lower values may result in underfitting.
verbose: The verbosity of the training log displayed.
(b) Parameters specific to classification:
classes_count: In multi-classification tasks, the number of classes.
class_weights: weights of various classes, used for imbalanced classification tasks.
auto_class_weights: Automatic weight calculation method for handling class imbalance.
(c) Parameters specific to regression:
scale_pos_weight: used for imbalanced regression tasks.
(d) Similarities and differences:
Similarities: Most parameters (such as learning_rate, depth, l2_leaf_reg, etc.) are the same in regression and classification tasks, and their meanings and effects are also consistent.
Difference: The loss function loss_function is determined based on the task (regression or classification). Furthermore, some parameters such as classes_count and class_weights are only meaningful in classification tasks, while scale_pos_weight is more inclined to regression tasks.
Also, when using CatBoost, it is recommended to always consult its official documentation, as the library may be updated frequently and new parameters or functions may be added. The URL is as follows:
https://catboost.ai/docs/
(2) Single-step rolling prediction
import pandas as pd import numpy as np from sklearn.metrics import mean_absolute_error, mean_squared_error from catboost import CatBoostRegressor from sklearn.model_selection import GridSearchCV #Read data data = pd.read_csv('data.csv') # Convert time column to date format data['time'] = pd.to_datetime(data['time'], format='%b-%y') #Create lag phase features lag_period=6 for i in range(lag_period, 0, -1): data[f'lag_{i}'] = data['incidence'].shift(lag_period - i + 1) # Delete rows containing NaN data = data.dropna().reset_index(drop=True) # Divide training set and validation set train_data = data[(data['time'] >= '2004-01-01') & amp; (data['time'] <= '2011-12-31')] validation_data = data[(data['time'] >= '2012-01-01') & amp; (data['time'] <= '2012-12-31')] # Define features and target variables X_train = train_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']] y_train = train_data['incidence'] X_validation = validation_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']] y_validation = validation_data['incidence'] #Initialize CatBoostRegressor model catboost_model = CatBoostRegressor(verbose=0) #Define parameter grid param_grid = { 'iterations': [50, 100, 150], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1], 'depth': [4, 6, 8], 'loss_function': ['RMSE'] } # Initialize grid search grid_search = GridSearchCV(catboost_model, param_grid, cv=5, scoring='neg_mean_squared_error') # Perform grid search grid_search.fit(X_train, y_train) # Get the best parameters best_params = grid_search.best_params_ # Initialize the CatBoostRegressor model with optimal parameters best_catboost_model = CatBoostRegressor(**best_params, verbose=0) # Train the model on the training set best_catboost_model.fit(X_train, y_train) # For the validation set, we need to iteratively predict each data point y_validation_pred = [] for i in range(len(X_validation)): if i == 0: pred = best_catboost_model.predict([X_validation.iloc[0]]) else: new_features = list(X_validation.iloc[i, 1:]) + [pred[0]] pred = best_catboost_model.predict([new_features]) y_validation_pred.append(pred[0]) y_validation_pred = np.array(y_validation_pred) # Calculate MAE, MAPE, MSE and RMSE on the validation set mae_validation = mean_absolute_error(y_validation, y_validation_pred) mape_validation = np.mean(np.abs((y_validation - y_validation_pred) / y_validation)) mse_validation = mean_squared_error(y_validation, y_validation_pred) rmse_validation = np.sqrt(mse_validation) # Calculate MAE, MAPE, MSE and RMSE on the training set y_train_pred = best_catboost_model.predict(X_train) mae_train = mean_absolute_error(y_train, y_train_pred) mape_train = np.mean(np.abs((y_train - y_train_pred) / y_train)) mse_train = mean_squared_error(y_train, y_train_pred) rmse_train = np.sqrt(mse_train) print("Train Metrics:", mae_train, mape_train, mse_train, rmse_train) print("Validation Metrics:", mae_validation, mape_validation, mse_validation, rmse_validation)
See the results:
(3) Multi-step rolling prediction-vol. 1
For Catboost regression, the target variable y_train cannot be a multi-column DataFrame, so you know.
(4) Multi-step rolling prediction-vol. 2
Same as above.
(5) Multi-step rolling prediction-vol. 3
import pandas as pd import numpy as np from catboost import CatBoostRegressor # Import CatBoostRegressor from sklearn.model_selection import GridSearchCV from sklearn.metrics import mean_absolute_error, mean_squared_error #Data reading and preprocessing data = pd.read_csv('data.csv') data_y = pd.read_csv('data.csv') data['time'] = pd.to_datetime(data['time'], format='%b-%y') data_y['time'] = pd.to_datetime(data_y['time'], format='%b-%y') n=6 for i in range(n, 0, -1): data[f'lag_{i}'] = data['incidence'].shift(n - i + 1) data = data.dropna().reset_index(drop=True) train_data = data[(data['time'] >= '2004-01-01') & amp; (data['time'] <= '2011-12-31')] X_train = train_data[[f'lag_{i}' for i in range(1, n + 1)]] m = 3 X_train_list = [] y_train_list = [] for i in range(m): X_temp = X_train y_temp = data_y['incidence'].iloc[n + i:len(data_y) - m + 1 + i] X_train_list.append(X_temp) y_train_list.append(y_temp) for i in range(m): X_train_list[i] = X_train_list[i].iloc[:-(m-1)] y_train_list[i] = y_train_list[i].iloc[:len(X_train_list[i])] # Model training param_grid = { 'iterations': [50, 100, 150], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1], 'depth': [4, 6, 8] } best_catboost_models = [] for i in range(m): grid_search = GridSearchCV(CatBoostRegressor(verbose=0), param_grid, cv=5, scoring='neg_mean_squared_error') # Use CatBoostRegressor grid_search.fit(X_train_list[i], y_train_list[i]) best_catboost_model = CatBoostRegressor(**grid_search.best_params_, verbose=0) best_catboost_model.fit(X_train_list[i], y_train_list[i]) best_catboost_models.append(best_catboost_model) validation_start_time = train_data['time'].iloc[-1] + pd.DateOffset(months=1) validation_data = data[data['time'] >= validation_start_time] X_validation = validation_data[[f'lag_{i}' for i in range(1, n + 1)]] y_validation_pred_list = [model.predict(X_validation) for model in best_catboost_models] y_train_pred_list = [model.predict(X_train_list[i]) for i, model in enumerate(best_catboost_models)] def concatenate_predictions(pred_list): concatenated = [] for j in range(len(pred_list[0])): for i in range(m): concatenated.append(pred_list[i][j]) return concatenated y_validation_pred = np.array(concatenate_predictions(y_validation_pred_list))[:len(validation_data['incidence'])] y_train_pred = np.array(concatenate_predictions(y_train_pred_list))[:len(train_data['incidence']) - m + 1] mae_validation = mean_absolute_error(validation_data['incidence'], y_validation_pred) mape_validation = np.mean(np.abs((validation_data['incidence'] - y_validation_pred) / validation_data['incidence'])) mse_validation = mean_squared_error(validation_data['incidence'], y_validation_pred) rmse_validation = np.sqrt(mse_validation) print("Validation set:", mae_validation, mape_validation, mse_validation, rmse_validation) mae_train = mean_absolute_error(train_data['incidence'][:-(m-1)], y_train_pred) mape_train = np.mean(np.abs((train_data['incidence'][:-(m-1)] - y_train_pred) / train_data['incidence'][:-(m-1)])) mse_train = mean_squared_error(train_data['incidence'][:-(m-1)], y_train_pred) rmse_train = np.sqrt(mse_train) print("Training set:", mae_train, mape_train, mse_train, rmse_train)
result:
3. Data
Link: https://pan.baidu.com/s/1EFaWfHoG14h15KCEhn1STg?pwd=q41n
Extraction code: q41n
The knowledge points of the article match the official knowledge archives, and you can further learn relevant knowledge. Python introductory skill treeArtificial intelligenceSupervised learning based on Python 374,705 people are learning the system