Marketing gain based on uplift model
Little O: Little H, we are doing a promotion recently and issuing coupons for some groups. The boss said that some users will not buy regardless of whether they issue coupons or not, so don’t waste it. So is there a way to identify it?
Xiao H: This is a typical marketing gain. Generally, users are divided into four categories: Treatment Responders (issue discounts to buy TR), Treatment Non-Responders (issue discounts without buying TN), Control Responders (do not issue discounts and buy TRs). CR), Control Non-Responders (CN will not be purchased if discounts are not issued). Among them, Treatment Responders are called marketing sensitive groups.
We need to identify the TR group for marketing stimulation. For TN and CR groups, coupons may not be issued to reduce costs. The CN group is quite special. If you want to reduce costs as much as possible, you can choose not to issue coupons. If you want to increase user conversion, you can choose to issue coupons to stimulate.
The core is how to calculate uplift. This article adopts the premise of issuing coupons for CN groups:
u
p
l
i
f
t
=
P
T
R
+
P
C
N
?
P
T
N
?
P
C
R
uplift=P_{TR} + P_{CN}-P_{TN}-P_{CR}
uplift=PTR? + PCNPTNPCR?. This article refers to the Uplift model to improve user growth.
Data exploration
import pandas as pd import numpy as np import toad from sklearn.model_selection import train_test_split # data partition library import xgboost as xgb # Import custom modules import sys sys.path.append("/Users/heinrich/Desktop/Heinrich-blog/Data Analysis Manual") from keyIndicatorMapping import *
Students who need the following data can follow the official account HsuHeinrich and reply [data mining-uplift] to get it automatically~
# read data raw_data=pd.read_csv('data.csv') raw_data. head()
# View data information toad. detector. detect(raw_data)
Feature Engineering
# replacement string zip_code_dic={<!-- --> 'Surburban': 1, 'Urban': 2, 'Rural': 3 } channel_dic={<!-- --> 'Web': 1, 'Phone': 2, 'Multichannel': 3 } offer_dic={<!-- --> 'Buy One Get One': 1, 'Discount': 2, 'No Offer': 0 } raw_data=raw_data.replace({<!-- -->'zip_code': zip_code_dic, 'channel': channel_dic, 'offer':offer_dic})
# Compute activity lift def calc_uplift(df): # Calculate the conversion rate of each activity method base_conv = df[df.offer == 0]['conversion'].mean() disc_conv = df[df.offer == 2]['conversion'].mean() bogo_conv = df[df.offer == 1]['conversion'].mean() # Calculate the improvement effect of the conversion rate of the two activities disc_conv_uplift = disc_conv - base_conv bogo_conv_uplift = bogo_conv - base_conv print('Discount Conversion Uplift: {0}%'.format(np.round(disc_conv_uplift*100,2))) if len(df[df.offer == 1]['conversion']) > 0: print('-'*60) print('BOGO Conversion Uplift: {0}%'.format(np.round(bogo_conv_uplift*100,2)))
# Classify according to the calculation method of uplift score # Set control-experimental group raw_data['campaign_group'] = 1 raw_data.loc[raw_data.offer == 0, 'campaign_group'] = 0 # Classification raw_data['target_class'] = 0 # CN raw_data.loc[(raw_data.campaign_group == 0) & (raw_data.conversion == 1),'target_class'] = 1 # CR raw_data.loc[(raw_data.campaign_group == 1) & (raw_data.conversion == 0),'target_class'] = 2 # TN raw_data.loc[(raw_data.campaign_group == 1) & (raw_data.conversion == 1),'target_class'] = 3 # TR
# model data df_model = raw_data.drop(['offer','campaign_group','conversion'], axis=1) df_model.head()
Data modeling
# sample split X = df_model.drop(['target_class'],axis=1) y = df_model. target_class X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=56)
# XGB classification model training param_dist = {<!-- -->'objective':'multi:softmax', 'eval_metric':'logloss', 'use_label_encoder':False} model_xgb = xgb.XGBClassifier(**param_dist) model_xgb. fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, eval_metric='logloss', gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=None)
# calculate uplift_score y_prob = model_xgb.predict_proba(X) uplift_score=[] for i in y_prob: us=i[0] + i[3]-i[1]-i[2] uplift_score.append(us) raw_data['uplift_score'] = uplift_score raw_data. head()
Model evaluation
# model evaluation # Record the base of the activity calc_uplift(raw_data)
Discount Conversion Uplift: 7.66% -------------------------------------------------- ---------- BOGO Conversion Uplift: 4.52%
# High Uplift score: customer's uplift score > 3/4 quantile raw_data_lift = raw_data. copy() uplift_q_75 = raw_data_lift.uplift_score.quantile(0.75) raw_data_lift = raw_data_lift[raw_data_lift.uplift_score > uplift_q_75].reset_index(drop=True) # Calculate the improvement of top1/4 users calc_uplift(raw_data_lift)
Discount Conversion Uplift: 30.29% -------------------------------------------------- ---------- BOGO Conversion Uplift: 26.18%
The conversion improvement effect of the top25% uplift score users is significantly higher than that of the base
# Low Uplift Score: Customer's uplift score < 1/2 quantile raw_data_lift = raw_data. copy() uplift_q_50 = raw_data_lift.uplift_score.quantile(0.5) raw_data_lift = raw_data_lift[raw_data_lift.uplift_score < uplift_q_50].reset_index(drop=True) # Calculate the lift of bottom1/2 users calc_uplift(raw_data_lift)
Discount Conversion Uplift: -3.87% -------------------------------------------------- ---------- BOGO Conversion Uplift: -6.03%
50% of users with low uplift scores have significantly lower conversion lift than base
Result display
# rating histogram def plot_score_hist(df, y_col, score_col, cutoff=None): """ df: data set (including y_col, score columns) y_col: the field name of the target variable score_col: the field name of the score cutoff : the point at which the cutoff is rejected/passed return : Score distribution map of different types of users """ # preprocessing x1 = df[df[y_col]==0][score_col] x2 = df[df[y_col]==1][score_col] x3 = df[df[y_col]==2][score_col] x4 = df[df[y_col]==3][score_col] # drawing plt.title('Uplift Score Hist') sns.kdeplot(x1, shade=True, label='CN') sns.kdeplot(x2,shade=True,label='CR') sns.kdeplot(x3,shade=True,label='TN') sns.kdeplot(x4,shade=True,label='TR') if cutoff!=None: plt.axvline(x=cutoff) plt. legend() return plt
plot_score_hist(raw_data, 'target_class', 'uplift_score') plt. show()
TR and CN have higher uplift score, but the overall discrimination is average
Summary
Its essence is to perform multi-category training, and then calculate uplift
according to the formula. Therefore, marketing stimulation can be carried out for users with high uplift_score, and the threshold can be determined based on business or data form.
encourage each other~