Marketing gain based on uplift model

Little O: Little H, we are doing a promotion recently and issuing coupons for some groups. The boss said that some users will not buy regardless of whether they issue coupons or not, so don’t waste it. So is there a way to identify it?

Xiao H: This is a typical marketing gain. Generally, users are divided into four categories: Treatment Responders (issue discounts to buy TR), Treatment Non-Responders (issue discounts without buying TN), Control Responders (do not issue discounts and buy TRs). CR), Control Non-Responders (CN will not be purchased if discounts are not issued). Among them, Treatment Responders are called marketing sensitive groups.

We need to identify the TR group for marketing stimulation. For TN and CR groups, coupons may not be issued to reduce costs. The CN group is quite special. If you want to reduce costs as much as possible, you can choose not to issue coupons. If you want to increase user conversion, you can choose to issue coupons to stimulate.

The core is how to calculate uplift. This article adopts the premise of issuing coupons for CN groups:

uplift=P_{TR} + P_{CN}-P_{TN}-P_{CR}

uplift=PTR? + PCNPTNPCR?. This article refers to the Uplift model to improve user growth.

Data exploration

import pandas as pd
import numpy as np
import toad
from sklearn.model_selection import train_test_split # data partition library
import xgboost as xgb

# Import custom modules
import sys
sys.path.append("/Users/heinrich/Desktop/Heinrich-blog/Data Analysis Manual")
from keyIndicatorMapping import *

Students who need the following data can follow the official account HsuHeinrich and reply [data mining-uplift] to get it automatically~

# read data
raw_data=pd.read_csv('data.csv')
raw_data. head()

# View data information
toad. detector. detect(raw_data)

Feature Engineering

# replacement string
zip_code_dic={<!-- -->
    'Surburban': 1,
    'Urban': 2,
    'Rural': 3
}
channel_dic={<!-- -->
    'Web': 1,
    'Phone': 2,
    'Multichannel': 3
}
offer_dic={<!-- -->
    'Buy One Get One': 1,
    'Discount': 2,
    'No Offer': 0
}

raw_data=raw_data.replace({<!-- -->'zip_code': zip_code_dic,
                          'channel': channel_dic,
                          'offer':offer_dic})

# Compute activity lift
def calc_uplift(df):
    # Calculate the conversion rate of each activity method
    base_conv = df[df.offer == 0]['conversion'].mean()
    disc_conv = df[df.offer == 2]['conversion'].mean()
    bogo_conv = df[df.offer == 1]['conversion'].mean()
    
    # Calculate the improvement effect of the conversion rate of the two activities
    disc_conv_uplift = disc_conv - base_conv
    bogo_conv_uplift = bogo_conv - base_conv
    
    print('Discount Conversion Uplift: {0}%'.format(np.round(disc_conv_uplift*100,2)))
    
    if len(df[df.offer == 1]['conversion']) > 0:
          
        print('-'*60)
        print('BOGO Conversion Uplift: {0}%'.format(np.round(bogo_conv_uplift*100,2)))

# Classify according to the calculation method of uplift score

# Set control-experimental group
raw_data['campaign_group'] = 1
raw_data.loc[raw_data.offer == 0, 'campaign_group'] = 0
# Classification
raw_data['target_class'] = 0 # CN
raw_data.loc[(raw_data.campaign_group == 0) & (raw_data.conversion == 1),'target_class'] = 1 # CR
raw_data.loc[(raw_data.campaign_group == 1) & (raw_data.conversion == 0),'target_class'] = 2 # TN
raw_data.loc[(raw_data.campaign_group == 1) & (raw_data.conversion == 1),'target_class'] = 3 # TR

# model data
df_model = raw_data.drop(['offer','campaign_group','conversion'], axis=1)
df_model.head()

Data modeling

# sample split
X = df_model.drop(['target_class'],axis=1)
y = df_model. target_class
X_train, X_test, y_train, y_test = train_test_split(X,
            y, test_size=0.2, random_state=56)

# XGB classification model training
param_dist = {<!-- -->'objective':'multi:softmax', 'eval_metric':'logloss', 'use_label_encoder':False}
model_xgb = xgb.XGBClassifier(**param_dist)
model_xgb. fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

# calculate uplift_score
y_prob = model_xgb.predict_proba(X)
uplift_score=[]
for i in y_prob:
    us=i[0] + i[3]-i[1]-i[2]
    uplift_score.append(us)
raw_data['uplift_score'] = uplift_score
raw_data. head()

Model evaluation

# model evaluation
# Record the base of the activity
calc_uplift(raw_data)

Discount Conversion Uplift: 7.66%
-------------------------------------------------- ----------
BOGO Conversion Uplift: 4.52%

# High Uplift score: customer's uplift score > 3/4 quantile
raw_data_lift = raw_data. copy()
uplift_q_75 = raw_data_lift.uplift_score.quantile(0.75)
raw_data_lift = raw_data_lift[raw_data_lift.uplift_score > uplift_q_75].reset_index(drop=True)
# Calculate the improvement of top1/4 users
calc_uplift(raw_data_lift)

Discount Conversion Uplift: 30.29%
-------------------------------------------------- ----------
BOGO Conversion Uplift: 26.18%

The conversion improvement effect of the top25% uplift score users is significantly higher than that of the base

# Low Uplift Score: Customer's uplift score < 1/2 quantile
raw_data_lift = raw_data. copy()
uplift_q_50 = raw_data_lift.uplift_score.quantile(0.5)
raw_data_lift = raw_data_lift[raw_data_lift.uplift_score < uplift_q_50].reset_index(drop=True)
# Calculate the lift of bottom1/2 users
calc_uplift(raw_data_lift)

Discount Conversion Uplift: -3.87%
-------------------------------------------------- ----------
BOGO Conversion Uplift: -6.03%

50% of users with low uplift scores have significantly lower conversion lift than base

Result display

# rating histogram
def plot_score_hist(df, y_col, score_col, cutoff=None):
    """
    df: data set (including y_col, score columns)
    y_col: the field name of the target variable
    score_col: the field name of the score
    cutoff : the point at which the cutoff is rejected/passed
    
    return : Score distribution map of different types of users
    """
    # preprocessing
    x1 = df[df[y_col]==0][score_col]
    x2 = df[df[y_col]==1][score_col]
    x3 = df[df[y_col]==2][score_col]
    x4 = df[df[y_col]==3][score_col]
    # drawing
    plt.title('Uplift Score Hist')
    sns.kdeplot(x1, shade=True, label='CN')
    sns.kdeplot(x2,shade=True,label='CR')
    sns.kdeplot(x3,shade=True,label='TN')
    sns.kdeplot(x4,shade=True,label='TR')
    if cutoff!=None:
        plt.axvline(x=cutoff)
    plt. legend()
    return plt

plot_score_hist(raw_data, 'target_class', 'uplift_score')
plt. show()

output_20_0

TR and CN have higher uplift score, but the overall discrimination is average

Summary

Its essence is to perform multi-category training, and then calculate uplift according to the formula. Therefore, marketing stimulation can be carried out for users with high uplift_score, and the threshold can be determined based on business or data form.

encourage each other~