64-bit system demonstration based on WIN10
1. Write in front
In this issue, we introduce the return of LightGBM.
Again, using this data here:
Demonstration of public data from a 2015 article in “PLoS One” entitled “Comparison of Two Hybrid Models for Forecasting the Incidence of Hemorrhagic Fever with Renal Syndrome in Jiangsu Province, China”. The data is the monthly incidence rate of hemorrhagic fever with renal syndrome in Jiangsu Province from January 2004 to December 2012. Data from January 2004 to December 2011 were used to predict incidence data for the 12 months of 2012.
2. LightGMB returns
(1) Parameter interpretation
LightGBM can handle both classification and regression tasks, and most parameters are common between the two tasks, but some are specific to the task type. The following are the similarities and differences in parameter settings between the two tasks:
(a) Similarities:
Core parameters: such as boosting_type, num_boost_round, learning_rate, etc.
Learning Control Parameters: These control how the decision tree is structured and fitted. For example, max_depth, num_leaves, min_data_in_leaf, feature_fraction, bagging_fraction, lambda_l1, lambda_l2, etc.
IO parameters: Parameters that control input and output, such as max_bin and min_data_in_bin.
Other parameters: such as verbosity, boost_from_average, etc.
(b) Differences:
Objective function (objective or application parameter):
Classification: You can choose binary (two categories) or multiclass (multiclass). For multi-class classification there is also a parameter num_class to specify the number of classes.
Regression: Regression is usually chosen. There are other regression objectives such as regression_l1, huber, fair, etc.
(c) Metric parameter (metric parameter):
Category: For example binary_logloss, binary_error, multi_logloss, multi_error, etc.
Regression: For example, l2 (MSE), l1 (MAE), mape, etc.
(d) Class weight (class_weight parameter):
Classification: When the categories of the data set are unbalanced, you can use this parameter to set different weights for each category.
Regression: This parameter is generally not applicable.
(e) Other specific parameters:
Classification: For example, scale_pos_weight can be used to handle very unbalanced binary classification problems.
Regression: These specific parameters are usually not required.
In summary, although classification and regression tasks have many similarities in the parameters of LightGBM, their objective functions and evaluation criteria are different and need to be adjusted according to the specific tasks.
(2) Single-step rolling prediction
import pandas as pd import numpy as np from sklearn.metrics import mean_absolute_error, mean_squared_error from lightgbm import LGBMRegressor from sklearn.model_selection import GridSearchCV #Read data data = pd.read_csv('data.csv') # Convert time column to date format data['time'] = pd.to_datetime(data['time'], format='%b-%y') #Create lag phase features lag_period=6 for i in range(lag_period, 0, -1): data[f'lag_{i}'] = data['incidence'].shift(lag_period - i + 1) # Delete rows containing NaN data = data.dropna().reset_index(drop=True) # Divide training set and validation set train_data = data[(data['time'] >= '2004-01-01') & amp; (data['time'] <= '2011-12-31')] validation_data = data[(data['time'] >= '2012-01-01') & amp; (data['time'] <= '2012-12-31')] # Define features and target variables X_train = train_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']] y_train = train_data['incidence'] X_validation = validation_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']] y_validation = validation_data['incidence'] #Initialize LGBMRegressor model lgbm_model = LGBMRegressor() #Define parameter grid param_grid = { 'n_estimators': [50, 100, 150], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1], 'num_leaves': [31, 50, 75], 'boosting_type': ['gbdt', 'dart', 'goss'] } # Initialize grid search grid_search = GridSearchCV(lgbm_model, param_grid, cv=5, scoring='neg_mean_squared_error') # Perform grid search grid_search.fit(X_train, y_train) # Get the best parameters best_params = grid_search.best_params_ # Initialize the LGBMRegressor model with optimal parameters best_lgbm_model = LGBMRegressor(**best_params) # Train the model on the training set best_lgbm_model.fit(X_train, y_train) # For the validation set, we need to iteratively predict each data point y_validation_pred = [] for i in range(len(X_validation)): if i == 0: pred = best_lgbm_model.predict([X_validation.iloc[0]]) else: new_features = list(X_validation.iloc[i, 1:]) + [pred[0]] pred = best_lgbm_model.predict([new_features]) y_validation_pred.append(pred[0]) y_validation_pred = np.array(y_validation_pred) # Calculate MAE, MAPE, MSE and RMSE on the validation set mae_validation = mean_absolute_error(y_validation, y_validation_pred) mape_validation = np.mean(np.abs((y_validation - y_validation_pred) / y_validation)) mse_validation = mean_squared_error(y_validation, y_validation_pred) rmse_validation = np.sqrt(mse_validation) # Calculate MAE, MAPE, MSE and RMSE on the training set y_train_pred = best_lgbm_model.predict(X_train) mae_train = mean_absolute_error(y_train, y_train_pred) mape_train = np.mean(np.abs((y_train - y_train_pred) / y_train)) mse_train = mean_squared_error(y_train, y_train_pred) rmse_train = np.sqrt(mse_train) print("Train Metrics:", mae_train, mape_train, mse_train, rmse_train) print("Validation Metrics:", mae_validation, mape_validation, mse_validation, rmse_validation)
See the results:
(3) Multi-step rolling prediction-vol. 1
For LGBMRegressor, the target variable y_train cannot be a multi-column DataFrame, so you know.
(4) Multi-step rolling prediction-vol. 2
Same as above.
(5) Multi-step rolling prediction-vol. 3
import pandas as pd import numpy as np from lightgbm import LGBMRegressor # Import LGBMRegressor from sklearn.model_selection import GridSearchCV from sklearn.metrics import mean_absolute_error, mean_squared_error #Data reading and preprocessing data = pd.read_csv('data.csv') data_y = pd.read_csv('data.csv') data['time'] = pd.to_datetime(data['time'], format='%b-%y') data_y['time'] = pd.to_datetime(data_y['time'], format='%b-%y') n=6 for i in range(n, 0, -1): data[f'lag_{i}'] = data['incidence'].shift(n - i + 1) data = data.dropna().reset_index(drop=True) train_data = data[(data['time'] >= '2004-01-01') & amp; (data['time'] <= '2011-12-31')] X_train = train_data[[f'lag_{i}' for i in range(1, n + 1)]] m = 3 X_train_list = [] y_train_list = [] for i in range(m): X_temp = X_train y_temp = data_y['incidence'].iloc[n + i:len(data_y) - m + 1 + i] X_train_list.append(X_temp) y_train_list.append(y_temp) for i in range(m): X_train_list[i] = X_train_list[i].iloc[:-(m-1)] y_train_list[i] = y_train_list[i].iloc[:len(X_train_list[i])] # Model training param_grid = { 'n_estimators': [50, 100, 150], 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1], 'boosting_type': ['gbdt', 'dart', 'goss'], 'num_leaves': [31, 63, 127] } best_lgbm_models = [] for i in range(m): grid_search = GridSearchCV(LGBMRegressor(), param_grid, cv=5, scoring='neg_mean_squared_error') # Use LGBMRegressor grid_search.fit(X_train_list[i], y_train_list[i]) best_lgbm_model = LGBMRegressor(**grid_search.best_params_) best_lgbm_model.fit(X_train_list[i], y_train_list[i]) best_lgbm_models.append(best_lgbm_model) validation_start_time = train_data['time'].iloc[-1] + pd.DateOffset(months=1) validation_data = data[data['time'] >= validation_start_time] X_validation = validation_data[[f'lag_{i}' for i in range(1, n + 1)]] y_validation_pred_list = [model.predict(X_validation) for model in best_lgbm_models] y_train_pred_list = [model.predict(X_train_list[i]) for i, model in enumerate(best_lgbm_models)] def concatenate_predictions(pred_list): concatenated = [] for j in range(len(pred_list[0])): for i in range(m): concatenated.append(pred_list[i][j]) return concatenated y_validation_pred = np.array(concatenate_predictions(y_validation_pred_list))[:len(validation_data['incidence'])] y_train_pred = np.array(concatenate_predictions(y_train_pred_list))[:len(train_data['incidence']) - m + 1] mae_validation = mean_absolute_error(validation_data['incidence'], y_validation_pred) mape_validation = np.mean(np.abs((validation_data['incidence'] - y_validation_pred) / validation_data['incidence'])) mse_validation = mean_squared_error(validation_data['incidence'], y_validation_pred) rmse_validation = np.sqrt(mse_validation) print("Validation set:", mae_validation, mape_validation, mse_validation, rmse_validation) mae_train = mean_absolute_error(train_data['incidence'][:-(m-1)], y_train_pred) mape_train = np.mean(np.abs((train_data['incidence'][:-(m-1)] - y_train_pred) / train_data['incidence'][:-(m-1)])) mse_train = mean_squared_error(train_data['incidence'][:-(m-1)], y_train_pred) rmse_train = np.sqrt(mse_train) print("Training set:", mae_train, mape_train, mse_train, rmse_train)
result:
3. Data
Link: https://pan.baidu.com/s/1EFaWfHoG14h15KCEhn1STg?pwd=q41n
Extraction code: q41n
The knowledge points of the article match the official knowledge archives, and you can further learn relevant knowledge. Python introductory skill treeArtificial intelligenceSupervised learning based on Python 370,869 people are learning the system