Multi-classification complete code example using CatBoost and SHAP

CatBoost is one of the top machine learning models. With its gradient boosting technique and built-in functions, it is possible to generate some really good models without much work. SHAP (SHapley Additive exPlanation) is designed to explain the output of machine learning models with unique visual and performance value. CatBoost and SHAP together form a powerful combination that produces some very accurate and interpretable results.

This article will show how to use them together to interpret results with multiclass datasets.

Dataset

The dataset is a 12 column by 13393 row collection obtained from Kaggle. It contains the physical results as well as the performance results of the physical tests. Objective Scoring is an A-D based multi-category system.

Dependency package

We need to import the following packages:

 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns
 from sklearn.model_selection import train_test_split
 import seaborn as sns
 from sklearn import metrics
 import plotly.express as px
 
 
 from sklearn. pipeline import Pipeline
 
 #models
 from sklearn.linear_model import LogisticRegression
 from sklearn.tree import DecisionTreeClassifier
 from sklearn.ensemble import GradientBoostingClassifier
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.neighbors import KNeighborsClassifier
 from sklearn.naive_bayes import BernoulliNB
 from sklearn.ensemble import BaggingClassifier
 from sklearn.ensemble import AdaBoostClassifier
 from sklearn.naive_bayes import GaussianNB
 from sklearn.neural_network import MLPClassifier
 from sklearn.linear_model import SGDClassifier
 from xgboost import XGBClassifier
 from catboost import CatBoostClassifier
 import xgboost as xgb
 import catboost
 
 from sklearn.model_selection import train_test_split
 
 #scoring
 from sklearn.metrics import confusion_matrix
 from sklearn.metrics import accuracy_score, precision_score, recall_score, average_precision_score, roc_auc_score, precision_recall_curve, roc_curve, auc
 
 from sklearn.model_selection import cross_val_score
 from sklearn.model_selection import GridSearchCV
 
 import shape

Data cleaning/EDA

The dataset has no missing values, so we directly do EDA to look at the distribution of the features and check for outliers.

Then do a simple process to delete extreme values:

 df = df[df['height_cm'] > 130]
 df = df[df['weight_kg'] < 120]

Data segmentation and testing

Then created the train/test split and built a pipeline and compared all the models selected in the 5-split cross-validation.

 X = df.drop('class', axis=1)
 y = df[['class']]
 y = y.values.ravel()
 
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=46)

The following are some common models in sklearn, which we have used directly for comparison:

 #Build pipeline for models
 
 pipeline_lr = Pipeline([('lr_classifier',LogisticRegression())])
 
 pipeline_dt = Pipeline([('dt_classifier',DecisionTreeClassifier())])
 
 pipeline_gbcl = Pipeline([('gbcl_classifier',GradientBoostingClassifier())])
 
 pipeline_rf = Pipeline([('rf_classifier',RandomForestClassifier())])
 
 pipeline_knn = Pipeline([('knn_classifier',KNeighborsClassifier())])
 
 pipeline_bnb = Pipeline([('bnb_classifer',BernoulliNB())])
 
 pipeline_bag = Pipeline([('bag_classifer',BaggingClassifier())])
 
 pipeline_ada = Pipeline([('bnb_classifer',AdaBoostClassifier())])
 
 pipeline_gnb = Pipeline([('gnb_classifer',GaussianNB())])
 
 pipeline_mlp = Pipeline([('mlp_classifer',MLPClassifier())])
 
 pipeline_sgd = Pipeline([('sgd_classifer',SGDClassifier())])
 
 pipeline_xgb = Pipeline([('xgb_classifer',XGBClassifier())])
 
 pipeline_cat = Pipeline([('cat_classifier',CatBoostClassifier(verbose=False))])
 
 # List of all the pipelines
 pipelines = [pipeline_lr, pipeline_dt, pipeline_gbcl, pipeline_rf, pipeline_knn, pipeline_bnb, pipeline_bag, pipeline_ada, pipeline_gnb, pipeline_mlp, pipeline_sgd, pipeline_xgb, pipeline_cat]
 
 # Dictionary of pipelines and classifier types for ease of reference
 pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'Gradient Boost', 3:'RandomForest', 4: 'KNN', 5: 'BN', 6:'Bagging', 7: 'Ada Boost', 8:'GaussianNB', 9:'MLP Classifier', 10:'SGD Classifier', 11:'XG Boost', 12:'Cat Boost'}
 
 
 # Fitting the pipelines
 for pipe in pipelines:
     pipe. fit(X_train, y_train)

List all results to determine the best model

 cv_results_accuracy = []
 for i, model in enumerate(pipelines):
     cv_score = cross_val_score(model, X_train, y_train, cv=5)
     cv_results_accuracy.append(cv_score)
     print("%s: %f " % (pipe_dict[i], cv_score.mean()))

It can be seen that although CatBoost is not the highest score in the CV comparison, although CatBoost is lower than XGB, its speed is much faster than XGB, so we use it in this project.

Another piece of information here is that basically the tree model scores very high, which also shows that the current tree model is still the best choice for processing tabular data.

Model results

 model = CatBoostClassifier(verbose=False)
 model. fit(X_train, y_train)
 
 #Print scores for Multiclass
 y_test_pred = model. predict(X_test)
 y_test_prob = model.predict_proba(X_test)
 
 print(metrics.classification_report(y_test, y_test_pred, digits=3))
 print('Accuracy score: ', accuracy_score(y_test, y_test_pred))
 print('Roc auc score : ', roc_auc_score(y_test, y_test_prob, multi_class='ovr'))

confusion matrix:

 confusion = confusion_matrix(y_test, y_test_pred)
 
 fig = px.imshow(confusion, labels=dict(x="Predicted Value", y="Actual Vlaue"), x=[1,2,3,4],y=[1,2,3,4] ,text_auto=True, title='Confusion Matrix')
 fig.update_layout(title_x=0.5)
 fig. show()

ROC curve for classification

 from itertools import cycle
 
 fig, ax = plt.subplots(figsize=(6, 6))
 
 plt.plot(
     fpr["micro"],
     tpr["micro"],
     label=f"micro-average ROC curve (AUC = {roc_auc['micro']:.2f})",
     color="deeppink",
     linestyle=":",
     linewidth=4,
 )
 
 plt.plot(
     fpr["macro"],
     tpr["macro"],
     label=f"macro-average ROC curve (AUC = {roc_auc['macro']:.2f})",
     color="navy",
     linestyle=":",
     linewidth=4,
 )
 
 colors = cycle(["aqua", "darkorange", "cornflowerblue"])
 for class_id, color in zip(range(n_classes), colors):
     RocCurveDisplay.from_predictions(
         y_onehot_test[:, class_id],
         y_test_prob[:, class_id],
         name=f"ROC curve for {target_names[class_id]}",
         color=color,
         ax=ax,
     )
 
 plt.plot([0, 1], [0, 1], "k--", label="ROC curve for chance level (AUC = 0.5)")
 plt. axis("square")
 plt.xlabel("False Positive Rate")
 plt.ylabel("True Positive Rate")
 plt.title("Extension of Receiver Operating Characteristic\
to One-vs-Rest multiclass")
 plt. legend()
 plt. show()

It can be seen that the score of the CatBoost model is very high. The above are the basic operations of our modeling. Next, we start to add SHAP

SHAP

In order to take advantage of SHAP, we need to create a binary model so they can be given a clear direction. So write a new result column that changes the scores from a – d to 0 and 1.

 def class2(score):
     if score > 1:
         return 1
     else:
         return 0
 
 df['class2'] = df['class']. apply(class2)

Next, create a new train/test split and CatBoost model for this new binary score.

Below are the results of the binary model

It can be seen that the result is better than the multi-class scoring model.

Let’s start using SHAP below. The first is feature importance, which shows the strength of each feature on the model.

 #Create list for cat features
 cat_features = list(range(0, X. shape[1]))
 print(cat_features)
 
 #Create feature importance
 featurep = model.get_feature_importance(prettified=True)

The results of SHAP feature importance are as follows:

Visualizations make it very clear which values have the most impact on the model

Although it is not as simple as every feature having importance in one direction, its importance can be directly distributed at a certain stage in each direction.

Let’s look at the beeswarm diagram again:

 #Create explainer and shape values from model
 explainer = shap. Explainer(model2)
 shap_values = explainer(X_test2)
 
 #Plot shap beesworm
 shap.plots.beeswarm(shap_values)

Here is the distribution map of beeswarm. It can show plots of each feature and its impact on the model from two directions (see figure below). And it also shows the influence by color and scale to the right, and the volume of influence by size. This allows us to see how each feature affects the score, and to what extent in each particular direction.

We can also create SHAPs decision tree diagrams.

 #Plot shape decision tree
 expected_values = explainer. expected_value
 shap_array = explainer. shap_values(X_test2)
 
 shap.decision_plot(expected_values, shap_array[0:10],feature_names=list(X.columns))

Waterfall plots for SHAPs show individual predictions and how they are affected by each feature and its score. This waterfall chart shows how each feature score deviates in each direction as they are applied. This allows us to see the impact of each feature on the prediction.

The bottom is not deviated in all predictions, but when we look up we can see that the last few features move significantly in each direction. This is a great way to see how each feature affects the prediction/score.

We can also display waterfall charts of individual forecasts. Below we show 2 predictions, one with a positive score and one with a negative score.

These two independent prediction waterfall plots allow us to gain more insight into how each feature affects the prediction score. It gives us the SHAP value and range and direction of each feature. It also shows the score for each feature on the left. This allows us to decompose the impact of each feature on a single score or prediction.

To get a better understanding of each feature, we can also create a scatter plot using the SHAP values for each feature.

 #Create shape scatterplots for important features
 fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
 
 #SHAP scatter plots
 shap.plots.scatter(shap_values[:,"sit and bend forward_cm"],ax=ax[0],show=False)
 shap.plots.scatter(shap_values[:,"sit-ups counts"],ax=ax[1])

 shap.plots.scatter(shap_values[:,"weight_kg"],ax=ax[0],show=False)
 shap.plots.scatter(shap_values[:,"gender"],ax=ax[1])

A scatterplot of SHAP values shows a feature’s score on the x-axis and its SHAP value on the y-axis. This allows us to see the feature’s score in each direction of its SHAP value.

We can see that a scatterplot of SHAP values can look very different and can show us many different types of insights about how each attribute contributes to the overall score.

Summary

The examples in this article demonstrate the power of CatBoost, which makes it easy to create a good scoring model. But more importantly we demonstrate the power of SHAP in analyzing model features. It allows us to look at features from many different angles than we can explore with plain EDA and correlation. It really lives up to its name with additional explanations that allow predictive modeling through the model, giving us insight into the features themselves.

Full code:
https://avoid.overfit.cn/post/e0b9e5e6712048dba754cae5c1269b9b

By: lochie links