CatBoost is one of the top machine learning models. With its gradient boosting technique and built-in functions, it is possible to generate some really good models without much work. SHAP (SHapley Additive exPlanation) is designed to explain the output of machine learning models with unique visual and performance value. CatBoost and SHAP together form a powerful combination that produces some very accurate and interpretable results.
This article will show how to use them together to interpret results with multiclass datasets.
Dataset
The dataset is a 12 column by 13393 row collection obtained from Kaggle. It contains the physical results as well as the performance results of the physical tests. Objective Scoring is an A-D based multi-category system.
Dependency package
We need to import the following packages:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split import seaborn as sns from sklearn import metrics import plotly.express as px from sklearn. pipeline import Pipeline #models from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import BernoulliNB from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.naive_bayes import GaussianNB from sklearn.neural_network import MLPClassifier from sklearn.linear_model import SGDClassifier from xgboost import XGBClassifier from catboost import CatBoostClassifier import xgboost as xgb import catboost from sklearn.model_selection import train_test_split #scoring from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score, precision_score, recall_score, average_precision_score, roc_auc_score, precision_recall_curve, roc_curve, auc from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV import shape
Data cleaning/EDA
The dataset has no missing values, so we directly do EDA to look at the distribution of the features and check for outliers.
Then do a simple process to delete extreme values:
df = df[df['height_cm'] > 130] df = df[df['weight_kg'] < 120]
Data segmentation and testing
Then created the train/test split and built a pipeline and compared all the models selected in the 5-split cross-validation.
X = df.drop('class', axis=1) y = df[['class']] y = y.values.ravel() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=46)
The following are some common models in sklearn, which we have used directly for comparison:
#Build pipeline for models pipeline_lr = Pipeline([('lr_classifier',LogisticRegression())]) pipeline_dt = Pipeline([('dt_classifier',DecisionTreeClassifier())]) pipeline_gbcl = Pipeline([('gbcl_classifier',GradientBoostingClassifier())]) pipeline_rf = Pipeline([('rf_classifier',RandomForestClassifier())]) pipeline_knn = Pipeline([('knn_classifier',KNeighborsClassifier())]) pipeline_bnb = Pipeline([('bnb_classifer',BernoulliNB())]) pipeline_bag = Pipeline([('bag_classifer',BaggingClassifier())]) pipeline_ada = Pipeline([('bnb_classifer',AdaBoostClassifier())]) pipeline_gnb = Pipeline([('gnb_classifer',GaussianNB())]) pipeline_mlp = Pipeline([('mlp_classifer',MLPClassifier())]) pipeline_sgd = Pipeline([('sgd_classifer',SGDClassifier())]) pipeline_xgb = Pipeline([('xgb_classifer',XGBClassifier())]) pipeline_cat = Pipeline([('cat_classifier',CatBoostClassifier(verbose=False))]) # List of all the pipelines pipelines = [pipeline_lr, pipeline_dt, pipeline_gbcl, pipeline_rf, pipeline_knn, pipeline_bnb, pipeline_bag, pipeline_ada, pipeline_gnb, pipeline_mlp, pipeline_sgd, pipeline_xgb, pipeline_cat] # Dictionary of pipelines and classifier types for ease of reference pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'Gradient Boost', 3:'RandomForest', 4: 'KNN', 5: 'BN', 6:'Bagging', 7: 'Ada Boost', 8:'GaussianNB', 9:'MLP Classifier', 10:'SGD Classifier', 11:'XG Boost', 12:'Cat Boost'} # Fitting the pipelines for pipe in pipelines: pipe. fit(X_train, y_train)
List all results to determine the best model
cv_results_accuracy = [] for i, model in enumerate(pipelines): cv_score = cross_val_score(model, X_train, y_train, cv=5) cv_results_accuracy.append(cv_score) print("%s: %f " % (pipe_dict[i], cv_score.mean()))
It can be seen that although CatBoost is not the highest score in the CV comparison, although CatBoost is lower than XGB, its speed is much faster than XGB, so we use it in this project.
Another piece of information here is that basically the tree model scores very high, which also shows that the current tree model is still the best choice for processing tabular data.
Model results
model = CatBoostClassifier(verbose=False) model. fit(X_train, y_train) #Print scores for Multiclass y_test_pred = model. predict(X_test) y_test_prob = model.predict_proba(X_test) print(metrics.classification_report(y_test, y_test_pred, digits=3)) print('Accuracy score: ', accuracy_score(y_test, y_test_pred)) print('Roc auc score : ', roc_auc_score(y_test, y_test_prob, multi_class='ovr'))
confusion matrix:
confusion = confusion_matrix(y_test, y_test_pred) fig = px.imshow(confusion, labels=dict(x="Predicted Value", y="Actual Vlaue"), x=[1,2,3,4],y=[1,2,3,4] ,text_auto=True, title='Confusion Matrix') fig.update_layout(title_x=0.5) fig. show()
ROC curve for classification
from itertools import cycle fig, ax = plt.subplots(figsize=(6, 6)) plt.plot( fpr["micro"], tpr["micro"], label=f"micro-average ROC curve (AUC = {roc_auc['micro']:.2f})", color="deeppink", linestyle=":", linewidth=4, ) plt.plot( fpr["macro"], tpr["macro"], label=f"macro-average ROC curve (AUC = {roc_auc['macro']:.2f})", color="navy", linestyle=":", linewidth=4, ) colors = cycle(["aqua", "darkorange", "cornflowerblue"]) for class_id, color in zip(range(n_classes), colors): RocCurveDisplay.from_predictions( y_onehot_test[:, class_id], y_test_prob[:, class_id], name=f"ROC curve for {target_names[class_id]}", color=color, ax=ax, ) plt.plot([0, 1], [0, 1], "k--", label="ROC curve for chance level (AUC = 0.5)") plt. axis("square") plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") plt.title("Extension of Receiver Operating Characteristic\ to One-vs-Rest multiclass") plt. legend() plt. show()
It can be seen that the score of the CatBoost model is very high. The above are the basic operations of our modeling. Next, we start to add SHAP
SHAP
In order to take advantage of SHAP, we need to create a binary model so they can be given a clear direction. So write a new result column that changes the scores from a – d to 0 and 1.
def class2(score): if score > 1: return 1 else: return 0 df['class2'] = df['class']. apply(class2)
Next, create a new train/test split and CatBoost model for this new binary score.
Below are the results of the binary model
It can be seen that the result is better than the multi-class scoring model.
Let’s start using SHAP below. The first is feature importance, which shows the strength of each feature on the model.
#Create list for cat features cat_features = list(range(0, X. shape[1])) print(cat_features) #Create feature importance featurep = model.get_feature_importance(prettified=True)
The results of SHAP feature importance are as follows:
Visualizations make it very clear which values have the most impact on the model
Although it is not as simple as every feature having importance in one direction, its importance can be directly distributed at a certain stage in each direction.
Let’s look at the beeswarm diagram again:
#Create explainer and shape values from model explainer = shap. Explainer(model2) shap_values = explainer(X_test2) #Plot shap beesworm shap.plots.beeswarm(shap_values)
Here is the distribution map of beeswarm. It can show plots of each feature and its impact on the model from two directions (see figure below). And it also shows the influence by color and scale to the right, and the volume of influence by size. This allows us to see how each feature affects the score, and to what extent in each particular direction.
We can also create SHAPs decision tree diagrams.
#Plot shape decision tree expected_values = explainer. expected_value shap_array = explainer. shap_values(X_test2) shap.decision_plot(expected_values, shap_array[0:10],feature_names=list(X.columns))
Waterfall plots for SHAPs show individual predictions and how they are affected by each feature and its score. This waterfall chart shows how each feature score deviates in each direction as they are applied. This allows us to see the impact of each feature on the prediction.
The bottom is not deviated in all predictions, but when we look up we can see that the last few features move significantly in each direction. This is a great way to see how each feature affects the prediction/score.
We can also display waterfall charts of individual forecasts. Below we show 2 predictions, one with a positive score and one with a negative score.
These two independent prediction waterfall plots allow us to gain more insight into how each feature affects the prediction score. It gives us the SHAP value and range and direction of each feature. It also shows the score for each feature on the left. This allows us to decompose the impact of each feature on a single score or prediction.
To get a better understanding of each feature, we can also create a scatter plot using the SHAP values for each feature.
#Create shape scatterplots for important features fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(16,8)) #SHAP scatter plots shap.plots.scatter(shap_values[:,"sit and bend forward_cm"],ax=ax[0],show=False) shap.plots.scatter(shap_values[:,"sit-ups counts"],ax=ax[1])
shap.plots.scatter(shap_values[:,"weight_kg"],ax=ax[0],show=False) shap.plots.scatter(shap_values[:,"gender"],ax=ax[1])
A scatterplot of SHAP values shows a feature’s score on the x-axis and its SHAP value on the y-axis. This allows us to see the feature’s score in each direction of its SHAP value.
We can see that a scatterplot of SHAP values can look very different and can show us many different types of insights about how each attribute contributes to the overall score.
Summary
The examples in this article demonstrate the power of CatBoost, which makes it easy to create a good scoring model. But more importantly we demonstrate the power of SHAP in analyzing model features. It allows us to look at features from many different angles than we can explore with plain EDA and correlation. It really lives up to its name with additional explanations that allow predictive modeling through the model, giving us insight into the features themselves.
Full code:
https://avoid.overfit.cn/post/e0b9e5e6712048dba754cae5c1269b9b
By: lochie links