Differences in CatBoost, LightGBM, XGBoost structure, practice, performance, etc.

Boosting algorithms are a class of machine learning algorithms that build a strong classifier by iteratively training a series of weak classifiers (usually decision trees). In each round of iterations, new classifiers are designed to correct the errors of previous rounds of classifiers, thereby gradually improving the overall classification performance.

Despite the rise and popularity of neural networks, boosting algorithms are still quite practical. Because they still perform well with limited training data, short training time, lack of expertise in parameter tuning, etc.

Boosting algorithms include AdaBoost, CatBoost, LightGBM, XGBoost, etc.

In this article, we will focus on CatBoost, LightGBM, and XGBoost. will include:

  • Structural difference;

  • How each algorithm handles categorical variables;

  • understand parameters;

  • practice on datasets;

  • performance of each algorithm.

Article from: https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

Source: I have to learn the city. In order to adapt to Chinese reading habits, reading has a more sense of substitution, and the original text has been deleted after translation

Since XGBoost (often referred to as GBM Killer) has been around in machine learning for a long time and there are many articles dedicated to it, this article will focus more on CatBoost and LGBM.

1. Structural differences between LightGBM and XGBoost

LightGBM uses a novel Gradient-based One-Side Sampling (GOSS) technique to filter data instances when looking for split values, while XGBoost uses a pre-sorted algorithm and a histogram-based algorithm (Histogram-based algorithm) to calculate the optimal split.

The instances above refer to observations/samples.

First, let’s understand how XGBoost’s presort splitting works:

  • For each node, enumerate all features;

  • For each feature, sort the instances by feature value;

  • Use a linear scan to determine the best split on the feature based on information gain;

  • Select the best splitting solution among all features.

In simple terms, histogram-based algorithms divide all the data points of a feature into discrete bins and use these bins to find the splitting values of the histogram. Although it is more efficient in training speed than the pre-sorting algorithm, which needs to enumerate all possible split points on the pre-sorted eigenvalues, it still lags behind GOSS in terms of speed.

So, what makes the GOSS method efficient?

In AdaBoost, sample weights can be used as a good indicator of sample importance. However, in gradient boosted decision tree (GBDT), there is no native sample weight, so it cannot be directly applied to the sampling method proposed by AdaBoost. This introduces gradient-based sampling methods.

The gradient represents the slope of the tangent to the loss function, so in a sense, if the gradient of the data points is large, these points are important for finding the best split point because they have a higher error.

GOSS keeps all instances with large gradients and randomly samples instances with small gradients. For example, suppose I have 500,000 rows of data, 10,000 of which have large gradients. Therefore, my algorithm will choose (10k rows with large gradients + x% random selection of the remaining 490k rows). Assuming x is 10%, the total number of rows selected is 59k based on which the split values are found.

The basic assumption here is that training instances with smaller gradients have smaller training errors and are already trained well. In order to maintain the same data distribution, when computing information gain, GOSS introduces a constant multiplier for data instances with smaller gradients. Therefore, GOSS achieves a good balance between reducing the number of data instances and maintaining the accuracy of learning decision trees.

LGBM is on leaves with large gradient/error further growth

2. How does each model handle categorical variables?

2.1 CatBoost

CatBoost has the flexibility to provide indexes on categorical columns so that they can be encoded using one-hot encoding, using the one_hot_max_size parameter (use one-hot encoding for all features with a number of distinct values less than or equal to the given parameter value).

If nothing is passed in the cat_features parameter, CatBoost will treat all columns as numeric variables.

Note: CatBoost will throw an error if a column containing string values is not provided in cat_features. In addition, columns that default to int type will be treated as numeric by default. If you want to treat them as categorical variables, you must specify them in cat_features.

For the remaining categorical columns, the number of unique categories For columns larger than one_hot_max_size, CatBoost uses an efficient encoding method similar to mean encoding but reducing overfitting. The process is as follows:

  • Random permutes the input observation set in a random order, generating multiple random permutations;

  • Convert label values from floats or categories to integers;

  • Convert all categorical feature values to numeric values using the following formula:

Among them, countInClass indicates that the tag value is equal to “1” The number of occurrences of the current taxonomic feature value in objects of , prior is the preliminary value of the numerator, determined by the start parameter, and totalCount is the total number of objects before the current object that have a matching feature value for the current taxonomy.

Mathematically, it can be expressed by the following equation:

2.2 LightGBM

Similar to CatBoost, LightGBM can also handle categorical features by inputting feature names. It does not convert to one-hot encoding, and is much faster than one-hot encoding. LGBM uses a special algorithm to find split values for categorical features.

Note: Before building the LGBM dataset, you Categorical features should be converted to integer types. It does not accept a string value even though it is passed through the categorical_feature parameter.

2.3 XGBoost

Unlike CatBoost or LGBM, XGBoost itself cannot handle categorical features, it only accepts numerical data similar to Random Forest. Therefore, various encodings such as label encoding, mean encoding, or one-hot encoding need to be performed before feeding categorical data to XGBoost.

3. Understanding parameters

All of these models have many parameters to tune, but we only discuss the important ones. Below is a list of these parameters, according to their function and their corresponding parameters in different models.

4. In the data Implementation on sets

I used the Kaggle dataset of flight delays in 2015 because it contains both categorical and numerical features. With approximately 5 million rows of data, this dataset is good for evaluating the performance of each type of boosting model in terms of speed and accuracy. I’ll be using a 10% subset of this data, about 500,000 rows. whaosoft aiot http://143ai.com

The following features are used for modeling:

  • MONTH, DAY, DAY_OF_WEEK: data type int

  • AIRLINE and FLIGHT_NUMBER: data type int

  • ORIGIN_AIRPORT and DESTINATION_AIRPORT: data type string

  • DEPARTURE_TIME: data type float

  • ARRIVAL_DELAY: This will be the target variable and converted to a boolean representing a delay of more than 10 minutes

  • DISTANCE and AIR_TIME: data type float

import pandas as pd, numpy as np, time
from sklearn.model_selection import train_test_split

data = pd.read_csv("./data/flights.csv")
data = data.sample(frac = 0.1, random_state=10)

data = data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME","DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"]>10)*1

cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes + 1

train, test, y_train, y_test = train_test_split(data. drop(["ARRIVAL_DELAY"], axis=1), data["ARRIVAL_DELAY"], random_state=10, test_size=0.25)

4.1 XGBoost

import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

def auc(m, train, test):
    return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
                            metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

# Parameter Tuning
model = xgb.XGBClassifier()
param_dist = {"max_depth": [10,30,50],
              "min_child_weight" : [1,3,6],
              "n_estimators": [200],
              "learning_rate": [0.05, 0.1, 0.16],}
grid_search = GridSearchCV(model, param_grid=param_dist, cv = 3,
                                   verbose=10, n_jobs=-1)
grid_search. fit(train, y_train)

grid_search.best_estimator_

model = xgb.XGBClassifier(max_depth=50, min_child_weight=1, n_estimators=200,\
                          n_jobs=-1, verbose=1, learning_rate=0.16)
model. fit(train, y_train)

auc(model, train, test)

4.2 LightGBM

import lightgbm as lgb
from sklearn import metrics

def auc2(m, train, test):
    return (metrics.roc_auc_score(y_train,m.predict(train)),
                            metrics.roc_auc_score(y_test,m.predict(test)))

lg = lgb.LGBMClassifier(verbose=0)
param_dist = {"max_depth": [25,50, 75],
              "learning_rate" : [0.01,0.05,0.1],
              "num_leaves": [300,900,1200],
              "n_estimators": [200]
             }
grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 3, scoring="roc_auc", verbose=5)
grid_search. fit(train, y_train)
grid_search.best_estimator_

d_train = lgb.Dataset(train, label=y_train)
params = {"max_depth": 50, "learning_rate": 0.1, "num_leaves": 900, "n_estimators": 300}

# Without Categorical Features
model2 = lgb. train(params, d_train)
auc2(model2, train, test)

#With Catgeorical Features
cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT"]
model2 = lgb.train(params, d_train, categorical_feature = cate_features_name)
auc2(model2, train, test)

4.3 CatBoost

When tuning the parameters of CatBoost, it is difficult to pass the index of the categorical features. So I tuned the parameters without passing categorical features and evaluated two models – one with categorical features and one without. I tuned one_hot_max_size alone since it doesn’t affect other parameters.

import catboost as cb
cat_features_index = [0,1,2,3,4,5,6]

def auc(m, train, test):
    return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
                            metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}
cb = cb.CatBoostClassifier()
cb_model = GridSearchCV(cb, params, scoring="roc_auc", cv = 3)
cb_model. fit(train, y_train)

With Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC", depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf. fit(train, y_train)
auc(clf, train, test)

With Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC", one_hot_max_size=31, \
                            depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train, cat_features= cat_features_index)
auc(clf, train, test)
5. Conclusion

When evaluating the model, we should start from the speed and Two aspects of accuracy consider the performance of the model.

With this in mind, CatBoost is the winner, with the highest accuracy on the test set (0.816), the least overfitting (close accuracy on the training and test sets), and the shortest prediction time and tuning time. But this is only because we considered categorical variables and adjusted one_hot_max_size. If we do not take advantage of these features of CatBoost, its accuracy is only 0.752, which is the worst performance. Therefore, we conclude that CatBoost only performs well when there are categorical variables in the data and we adjust them correctly.

Our next model to perform well is XGBoost. Even ignoring the fact that we had categorical variables in the data and converted them to numerical variables for XGBoost to use, its accuracy was still pretty close to CatBoost. However, the only problem with XGBoost is that it is too slow. It’s really frustrating to tune its parameters, especially with GridSearchCV (it took me 6 hours to run GridSearchCV, very bad idea!). A better approach is to tune the parameters individually instead of using GridSearchCV. Read this blog post to learn how to fine-tune parameters.

Finally, LightGBM ranks last. One thing to note here is that when using cat_features it does not perform well in terms of speed and accuracy. I think the reason it performs poorly is that it uses some kind of modified mean encoding on the categorical data, which leads to overfitting (very high training accuracy – 0.999 compared to low test accuracy). However, if used normally like XGBoost, it can achieve similar (or even higher) accuracy (LGBM – 0.785, XGBoost – 0.789) much faster than XGBoost.

Finally, I must say that these observations apply to this particular dataset and may or may not be valid for other datasets. In general, however, it is a true fact that XGBoost is slower than the other two algorithms.