Based on the Titanic data set (one-hot encoding/label encoding) using the XGBoost algorithm (model export and loading reasoning of json files) to realize the application case of binary classification prediction

# 1. Define the data set

# 2. Data preprocessing

# 2.1, missing value filling

# 2.2. Structural features

# 2.3, feature encoding

# 2.4. Separate features and labels

# 3. Model training and evaluation

# 3.1, the data set is divided into training set and test set

# 3.2. Model training and evaluation

# 3.3, the model is exported as a JSON file

# Get the parameters of the model

# 4. Model reasoning

# 4.1, load the model file

# 4.2. Create a model and load the model jason parameters

# 4.3, model reasoning

# 4.3.1. Load a new sample

# 4.3.2, preprocessing new sample data

# 4.3.3, Based on the json file, the model needs to be retrained, and then reasoned and predicted

# 1. Define the data set

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 PassengerId 891 non-null int64
 1 Survived 891 non-null int64
 2 Pclass 891 non-null int64
 3 Name 891 non-null object
 4 Sex 891 non-null object
 5 Age 714 non-null float64
 6 SibSp 891 non-null int64
 7 Parch 891 non-null int64
 8 Ticket 891 non-null object
 9 Fare 891 non-null float64
 10 Cabin 204 non-null object
 11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7 + KB
   PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaNS
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaNS
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaNS

[5 rows x 12 columns]5

# 2. Data preprocessing

# 2.1, missing value filling

# 2.2 Structural features

after fillna and FE
<class 'pandas. core. frame. DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 Survived 891 non-null int64
 1 Pclass 891 non-null int64
 2 Sex 891 non-null object
 3 Age 891 non-null float64
 4 SibSp 891 non-null int64
 5 Parch 891 non-null int64
 6 Fare 891 non-null float64
 7 Embarked 891 non-null object
 8 FamilySize 891 non-null int64
 9 IsAlone 891 non-null int32
dtypes: float64(2), int32(1), int64(5), object(2)
memory usage: 66.3 + KB

# 2.3, feature encoding

after LabelEncoder
<class 'pandas. core. frame. DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 Survived 891 non-null int64
 1 Pclass 891 non-null int64
 2 Sex 891 non-null int32
 3 Age 891 non-null float64
 4 SibSp 891 non-null int64
 5 Parch 891 non-null int64
 6 Fare 891 non-null float64
 7 Embarked 891 non-null int32
 8 FamilySize 891 non-null int64
 9 IsAlone 891 non-null int32
dtypes: float64(2), int32(3), int64(5)
memory usage: 59.3 KB

# 2.4. Separate features and labels

# 3. Model training and evaluation

# 3.1. Data set division For training set and test set

# 3.2, model training and evaluation

Accuracy: 0.8435754189944135
F1: 0.7812500000000001
AUC: 0.8275978407557355
XGBoost 0.8435754189944135 0.7812500000000001 0.8275978407557355
XGBoost 0.832402235 0.765625 0.815519568
XGBoost + FamilySize 0.843575419 0.78125 0.827597841
XGBoost + FamilySize + IsAlone 0.843575419 0.78125 0.827597841

# 3.3, the model is exported as a JSON file

# Get the parameters of the model

model.json {'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode' : None, 'colsample_bytree': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'gpu_id': None, 'grow_policy': None , 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, ' max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state' : None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None }

# 4. Model reasoning

# 4.1, load model file

# 4.2. Create a model and load it Enter the model jason parameter

# 4.3, model reasoning

# 4.3.1, load a new sample

# 4.3.2, preprocessing new sample data

raw test data
   Pclass Sex Age SibSp Parch Fare Embarked FamilySize IsAlone
0 3 male 25 1 0 7.25 S 2 0
test data after LabelEncoder
   Pclass Sex Age SibSp Parch Fare Embarked FamilySize IsAlone
0 3 0 25 1 0 7.25 0 2 0

# 4.3.3. Model retraining is required based on json files, and then inference prediction is required

Model Reasoning
    Pclass Sex Age SibSp Parch Fare Embarked FamilySize IsAlone
0 3 0 25 1 0 7.25 0 2 0
Inference result: [0]

