Causal Inference–Principle and Application of Uplift Model (3)

Directory

1. Application Scenarios of Uplift Model

2. Principle and modeling method of Uplift Model

2.1 Modeling goals

2.2 Modeling method

1. Dual model–differential response model

2. Label transformation–Class Transformation Method

2.3 Model Evaluation

1. uplift histogram

2. gini curve

3. How to implement it in python

3.1 Data input and simple descriptive analysis

3.2 Modeling – dual model

3.3 uplift histogram

3.4 gini curve and AUUC

1. Application scenarios of Uplift Model

At present, refined operations have spread to all walks of life. How to invest marketing costs on users who are really moved by the operation strategy, instead of wasting them on users who will convert themselves, is an important issue for precision marketing, and it is also an important issue for improving input and output. important means of comparison. Next, combined with the marketing four-quadrant theory, we will further explain how to solve this problem:

According to this theory, we know that the marketing population can be divided into four major categories, based on the performance of users in intervention and non-intervention situations:

Marketing Sensitive Groups: convert with marketing intervention, and not convert without marketing intervention.

Naturally Converting Population: Converts with or without marketing intervention.

Apathetic crowd: Will not convert with or without marketing intervention.

Reactionary crowd: Transformation without marketing intervention, but not conversion with marketing intervention, that is, marketing aversion crowd.

Regarding the above problems, What we need to identify is marketing sensitive groups, and investing marketing budgets in these groups can bring about “incremental” effects. The Uplift Model is a relatively mature method in the industry to solve this problem. This model is a gain model. The prediction goal is to increase the user conversion rate under strategic intervention. The greater the increase, the more sensitive it is.

2. Uplift Model principle and modeling method

2.1 Modeling goals

The Uplift Model predicts the causal effect of an intervention on individual state or behavior, namely ITE (individual treament effect). The formula can be expressed as:

$p(Y_{i}|X_{i},T_{i} = 1) - p(Y_{i}|X_{i},T_{i} = 0)$

Among them, Y is the predicted value under intervention/non-intervention, X is user characteristics, and T is whether to intervene. The modeling objective is the difference between the user’s probabilities with and without the intervention. The difference with response models such as LR/RF/XGB/LGB is that these models predict the probability of an event, while the Uplift Model predicts the probability increase value of the event. The difficulty of modeling is that it is difficult to obtain sample data, and it is impossible to obtain the conversion of a user under intervention and non-intervention conditions at the same time. However, this problem can be solved by AB experiment. Through AB random shunting, the causal effect of the individual can be transformed into the causal effect of the group, that is, CATE:

$\tau(X_{i})=E(Y_{i}(1)-Y_{i}(0)|X_{i})$

2.2 Modeling method

1. Dual model–differential response model

That is, model the users in the experimental group and the users in the control group, predict the conversion probability of users in the case of intervention/non-intervention, set two models for the users to be predicted, and get the Uplift Score by making the difference between the two probabilities. The following is the detailed operation process of the model:

The advantage of this model is that the modeling is simple. Two classification models are established separately, and the Uplift Score can be obtained by using the difference between the two classification models.

The disadvantages are: the two models are easy to accumulate errors; the goal of modeling is response rather than uplift, and the recognition of uplift is weak.

2. Label Transformation–Class Transformation Method

This method can get through the data and model of the experimental group/control group, and can directly predict the Uplift Score, that is, $\tau({X_{i}})$ . Define variable Z, whose values are as follows:

$Z = \left\{\begin{matrix} 1 & amp; if\, \displaystyle G = T \, and \, Y = 1\ 1 & amp; if \, G = C\, and \, Y = 0\ 0 & amp; otherwise \end{matrix}\right.\displaystyle$

predict $\tau(X)=P^{T}(Y=1|X)-P^{C}(Y=1|X)$ , can be transformed into prediction $p(Z = 1|X)$ . The detailed modeling process is shown in the figure below:

The derivation process of transforming into prediction Z=1 is as follows, optional reading. We know that in AB experiments, user characteristics and intervention strategies are independent of each other, so there are $P(T|X) = P(T),P(C|X)=P(C)$ , so the function to be predicted can be converted as follows:

$\begin{aligned} P(Z = 1|X) & amp;= P(Z = 1|X,T)P(T|X) + P(Z=1|X,C )P(C|X) \ & amp;= P(Y = 1|X,T)P(T|X) + P(Y=0|X,C)P(C|X) \ \ & amp;= P^{T}(Y = 1|X)P(T) + P^{C}(Y=0|X)P(C) \end{aligned}$

Also in the AB experiment, if the number of shunts in the experimental group and the control group is equal, then $P(T)=P(C)=\frac{1}{2}$ , the user is The probability of being assigned to the experimental group and the control group is the same, and the above formula can be transformed into:

$\begin{aligned} P(Z = 1|X) & amp;= P^{T}(Y = 1|X)P(T) + P^{C}(Y=0 |X)P(C) \ & amp;= \frac{1}{2}P^{T}(Y = 1|X) + \frac{1}{2}P^{C }(Y=0|X) \ & amp;= \frac{1}{2}P^{T}(Y = 1|X) + \frac{1}{2}(1- P^{C}(Y=1|X)) \end{aligned}$

Therefore, our target variable can be transformed into the following form:

$\begin{aligned} \tau(X) & = P^{T}(Y=1|X)-P^{C}(Y=1|X) \\ \ & amp; = 2P(Z=1|X) - 1 \end{aligned}$

2.3 Model Evaluation

For the response model, we have predicted values and real values on the test set, and model evaluation can use auc, recall, precision, etc. But for the Uplift Model, we cannot obtain the conversion probability of a user under intervention and without intervention at the same time, that is, the ground truth, so we need to construct the ground truth. How to construct it, that is, through the bridge of Uplift Score (this is the predicted value), users in the same score range are aligned. There are users in the experiment and the control group, and you can get the real conversion rate under the intervention and the real conversion rate without the intervention respectively. The difference between the two is the real gain score.

1. uplift histogram

Arrange the predicted Uplift Scores in descending order on the test set, divide them into ten equal parts, calculate the conversion rates of the experimental group and the control group in each equal part, and subtract the two to get the real gain score. In this way, the Uplift histogram can be drawn, and the marketing sensitive groups can be selected according to the graph, and it can also be known that the marketing intervention on the TopN proportion of users can bring incremental effects.

2. gini curve

Gini curve is also a method to measure the effect of uplift model. The area under the curve can be calculated as AUUC (similar to AUC). The calculation method is as follows

As with the above-mentioned uplif histogram, prepare the data results of the deciles of the experimental group and the control group
Calculate $Q(i) = \frac{ N_{t}(i)}{ N_{t}} - \frac{ N_{c}(i)}{ N_{c}}\$ , i means top i%, which is the sample data of top%, $N_{t}, N_{c}$ indicates the experimental group and control The total number of users in the group, $N_{t}(i)$ indicates the number of converted users among the top i% experimental group users, $N_{c}(i)$ Indicates the number of converted users among the top i% of the control group users.
The curve formed with i% as the abscissa and Q(i) as the ordinate is the gini curve, and the area enclosed by it is AUUC

3. How to implement in python

Next, let’s see how to implement the gain model in python. The data set used by the following code comes from kaggle: Marketing Promotion Campaign Uplift Modeling | Kaggle

The following code has good performance in the real data at work, but the performance on this data set is not good, the code can be used as a reference.

3.1 Data input and simple descriptive analysis< /strong>

%matplotlib inline import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns import xgboost as xgb from sklearn.model_selection import train_test_split import datetime from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve, classification_report, roc_auc_score from sklearn import metrics from xgboost.sklearn import XGBClassifier from xgboost import XGBClassifier import xgboost as xgb from sklearn.ensemble import RandomForestClassifier import lightgbm import graphviz from sklearn import tree from sklearn.tree import DecisionTreeClassifier # 1. Read data # The dataset comes from https://www.kaggle.com/datasets/davinwijaya/customer-retention data = pd.read_csv('./data.csv' ) data. head() # 2. View discrete variables and values for column in data.drop(columns = ['recency', 'history']): print(column, '\\ ', data[column]. value_counts(), '\\ ') # 3. Continuous variables look at the distribution con_columns = ['recency', 'history'] fig = plt.figure(figsize = (16, 10)) ax1 = fig.add_subplot(121) data['recency'].hist(ax = ax1) ax1.set_title('recency distribution') ax2 = fig.add_subplot(122) data['history'].hist(ax = ax2) ax2.set_title('history distribution') # 4. Modify the column names of target variables and intervention variables df_model = data.rename(columns = {'conversion': 'target'}) # target variable df_model = df_model.rename(columns = {'offer': 'treatment'}) # Whether to intervene df_model.treatment = df_model.treatment.map({'No Offer': 0, 'Buy One Get One': -1, 'Discount': 1}) # treatment reassignment # Convert discrete variables to dummy variables df_model = pd.get_dummies(df_model) df_model.head() # Look at the transformation under different intervention conditions df_model.groupby('treatment').agg({'target': 'mean'}) # 5. Definition method: Look at the improvement of the conversion rate under each feature def get_every_group_up(data, dim, ab_tag): # According to the input dimension dim, ab grouping field ab_tag to group and summarize the number of pieces and the amount of air freight dim_data = data.groupby(by = [dim, ab_tag]).agg({'target': 'sum', 'treatment': 'count'}) dim_data.columns = ['target', 'cnt'] # Assign the column name of treatment as the number of users in cnt dim_data['pct'] = dim_data['target']/dim_data['cnt'] # proportion of air freight # Calculate the air shipment rate of the experimental group and the control group for each factor dim_data_b = dim_data.loc[(slice(None), -1), 'pct'].loc[:, -1] dim_data_a = dim_data.loc[(slice(None), 0), 'pct'].loc[:, 0] # Calculate the distribution of each factor dim_data_pct = data[dim].value_counts()/data[dim].count() dim_data_df = pd.concat([dim_data_pct, dim_data_a, dim_data_b], axis = 1) # column renaming proportion conversion rate of control group conversion rate of experimental group dim_data_df.columns = ['proportion', 'pct_a', 'pct_b'] dim_data_df['uplift'] = dim_data_df['pct_b'] - dim_data_df['pct_a'] return dim_data_df # Check the conversion rate improvement of the experimental group (buy one get one free) and the control group (no discount) under different recency segments uplift_data = df_model[df_model['treatment'].isin([-1, 0])] uplift_data['recency_bins'], updown = pd.qcut(uplift_data['recency'], q = 5, retbins = True, duplicates = 'drop') diff_recency_bins_df = get_every_group_up(uplift_data, 'recency_bins', 'treatment') diff_recency_bins_df.style.background_gradient()

3.2 Modeling–Dual Model

public method

# Define the method to draw the model ROC curve threshold, etc., it is more convenient to view the effect when adjusting the model parameters def plot_model_result(preds, y_test): # roc curve related indicators FPR, recall, thresholds = roc_curve(y_test, preds, pos_label = 1) area = roc_auc_score(y_test, preds) plt. figure() plt.plot(FPR, recall, color = 'r', label = 'roc curve (area=%0.4f' % area) plt.plot([0, 1], [0, 1], color = 'black', linestyle = '--') maxindex = (recall - FPR).tolist().index(max(recall - FPR)) plt.scatter(FPR[maxindex], recall[maxindex], c = 'black', s = 30) print('threshold', thresholds[maxindex]) plt. xlim([-0.05, 1.05]) plt.ylim([-0.05, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('Recall') plt.title('ROC Curve') plt. legend(loc = 'best') plt. show() ypred = preds. copy() ypred[preds > thresholds[maxindex]] = 1 ypred[ypred != 1] = 0 print('classification_report\\ ',classification_report(y_test, ypred, digits=4)) # print('confusion_matrix \\ ', confusion_matrix(y_test, ypred)) # return thresholds[maxindex] # return the threshold

modeling

# 1. Model data preparation Take Buy One Get One, that is, treatment = -1 as an example # Control group modeling uplift_data_a = uplift_data[uplift_data['treatment'] == 0].drop(columns = ['recency_bins', 'history_bins']) # test group uplift_data_b = uplift_data[uplift_data['treatment'] == -1].drop(columns = ['recency_bins', 'history_bins']) x_a = uplift_data_a.loc[:, uplift_data_a.columns.difference(['treatment', 'target'])] y_a = uplift_data_a['target'] # Divide training set and test set x_train_a, x_test_a, y_train_a, y_test_a = train_test_split(x_a, y_a, test_size = 0.4, random_state = 20000) # Modeling Here we take RandomForest as an example clf_rf_a = RandomForestClassifier(n_estimators = 100, random_state = 200, max_depth = 10, min_samples_leaf = 10) clf_rf_a.fit(x_train_a, y_train_a) # Predict and see the effect on the test set y_pre_test_a = clf_rf_a.predict_proba(x_test_a) plot_model_result(y_pre_test_a[:, 1], y_test_a) # Experimental group modeling is similar to control group modeling # lgb model data set experimental group x_b = uplift_data_b.loc[:, uplift_data_a.columns.difference(['treatment', 'target'])] y_b = uplift_data_b['target'] x_train_b, x_test_b, y_train_b, y_test_b = train_test_split(x_b, y_b, test_size = 0.4, random_state = 20000) clf_rf_b = RandomForestClassifier(n_estimators = 100, random_state = 200, max_depth = 10, min_samples_leaf = 10) clf_rf_b.fit(x_train_b, y_train_b) y_pre_test_b = clf_rf_b.predict_proba(x_test_b) plot_model_result(y_pre_test_b[:, 1], y_test_b)

3.3 Uplift Histogram

# 1. The experimental group and the control group on the test set respectively take two models to predict # pre_a is the shipping rate of the control group user x_valid_a without intervention, use the experimental group model model_b to get the air shipping rate with intervention pre_a_under_b_model = clf_rf_b.predict_proba(x_test_a) # pre_b is the delivery rate of user x_valid_b in the experimental group with intervention, use the control model model_a to get the air delivery rate without intervention pre_b_under_a_model = clf_rf_a.predict_proba(x_test_b) # 2. Add the above two predicted values to the data sets of the experimental group and the control group respectively # Control group: splicing the shipping rates in the two cases into the corresponding data set x_test_a['pre_a'] = y_pre_test_a[:,0] x_test_a['pre_b'] = pre_a_under_b_model[:,0] x_test_a. head() test_a = pd. concat([x_test_a, y_test_a], axis = 1) test_a. head() # Experimental group: splicing the mailing rate in the two cases to the corresponding data set, x_test_b['pre_a'] = y_pre_test_b[:,0] x_test_b['pre_b'] = pre_b_under_a_model[:,0] x_test_b. head() test_b = pd. concat([x_test_b, y_test_b], axis = 1) test_b. head() # 3. Calculate the predicted uplift score test_a['uplift_score'] = test_a['pre_b'] - test_a['pre_a'] test_b['uplift_score'] = test_b['pre_b'] - test_b['pre_a'] # 4. Merge two sets of data and divide them into tenths together test_ab_all = pd. concat([test_a, test_b]) test_ab_all = pd.concat([test_ab_all, data[['channel', 'zip_code', 'offer']]], axis = 1 , join = 'inner') test_ab_all['treatment'] = test_ab_all['offer'].map(lambda x: 0 if x == 'No Offer' else 1) test_ab_all['uplift_bins'], updown = pd.qcut(test_ab_all['uplift_score'], q = 10, retbins = True, duplicates = 'drop', labels = np.arange(1, 11) ) test_ab_all # 5. Calculate the true conversion rate of the experimental group and the control group on each deciles c_pct = target sum / number of records in each deciles test_ab_all_group = test_ab_all.groupby(by = ['uplift_bins', 'treatment']).agg({'history': 'count', 'target': 'sum', 'pre_b': 'mean', 'pre_a': 'mean', 'uplift_score': 'mean'}) test_ab_all_group['c_pct'] = test_ab_all_group['target']/test_ab_all_group['history'] test_ab_all_group.columns = ['user_num', 'c_user_num', 'pre_b_mean', 'pre_a_mean', 'uplift_score_mean', 'c_pct'] test_ab_all_group # 6. Separate the experimental group and control group data from the above data b_group = test_ab_all_group.loc[(slice(None), [1]),:] a_group = test_ab_all_group.loc[(slice(None), [0]),:] # multiindex processing method a_group_ = a_group.xs(key = 0, level = 1) b_group_ = b_group.xs(key = 1, level = 1) # 7. Splice the control group data in the above deciles into the experimental group to calculate the real gain in each deciles b_group_ = pd.concat([b_group_, a_group_['c_pct']], axis = 1) b_group_.columns = ['user_num', 'c_user_num', 'pre_b_mean', 'pre_a_mean', 'uplift_score_mean', 'b_c_pct', 'a_c_pct'] # real_diff_pct is the real gain, the conversion rate of the experimental group after alignment - the conversion rate of the control group b_group_['real_diff_pct'] = b_group_['b_c_pct'] - b_group_['a_c_pct'] b_group_.index = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1] b_group_ plt.bar(b_group_.index, b_group_['real_diff_pct'])

3.4 gini curve and AUUC

# 1. Divide the tens respectively test_a['uplift_bins'], updown = pd.qcut(test_a['uplift_score'], q = 10, retbins = True, duplicates = 'drop', labels = np.arange(1, 11) ) test_b['uplift_bins'], updown = pd.qcut(test_b['uplift_score'], q = 10, retbins = True, duplicates = 'drop', labels = np.arange(1, 11) ) # 2. Aggregate with uplift deciles, calculate the number of users, real conversions, and conversion rates in each deciles test_a_group = test_a.groupby(by = ['uplift_bins']).agg({'history': 'count', 'target': 'sum'}) test_a_group.columns = ['user_num', 'c_user_num'] test_a_group['kw_pct'] = test_a_group['c_user_num']/test_a_group['user_num'] test_a_group test_b_group = test_b.groupby(by = ['uplift_bins']).agg({'history': 'count', 'target': 'sum'}) test_b_group.columns = ['user_num', 'c_user_num'] test_b_group['kw_pct'] = test_b_group['c_user_num']/test_a_group['user_num'] test_b_group # 3. Merge the data of the experimental group and the control group, and rename the column names test_union_group = pd. concat([test_b_group, test_a_group], axis = 1) test_union_group.columns = ['user_num_b', 'c_user_num_b', 'c_pct_b', 'user_num_a', 'c_user_num_a', 'c_pct_a'] test_union_group['uplift_score'] = test_union_group['c_pct_b'] - test_union_group['c_pct_a'] test_union_group test_union_group.index = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1] test_union_group # 5. Add the total number of users in the experimental group and control group test_union_group['total_b'] = test_union_group['user_num_b'].sum() test_union_group['total_a'] = test_union_group['user_num_a'].sum() # 6. Calculate the cumulative number of converted users in the experimental group and the control group respectively test_union_group['c_user_num_b_cums'] = test_union_group['c_user_num_b'].cumsum(axis = 0) test_union_group['c_user_num_a_cums'] = test_union_group['c_user_num_a'].cumsum(axis = 0) # 7. Calculate the cumulative conversion rate difference test_union_group['q_b_i'] = test_union_group['c_user_num_b_cums']/test_union_group['total_b'] test_union_group['q_a_i'] = test_union_group['c_user_num_a_cums']/test_union_group['total_a'] test_union_group['q_i'] = test_union_group['q_b_i'] - test_union_group['q_a_i'] # 8. Introduce package to draw gini curve and calculate AUUC from numpy import trapz # use the gradient rule to solve the integral from scipy.integrate import simps # Use the Simpson's rule to replace the rectangular or trapezoidal integral formula with the quadratic curve approximation to obtain the numerical approximate solution of the definite integral. x = np.r_[0, list(test_union_group.index)] y = np.r_[0, list(test_union_group['q_i'])] random_curve_area = simps([0, y[len(y) - 1]], [0,10], dx=0.001) # The area enclosed by the random curve and the x-axis gini_curve_area = simps(y, x, dx=0.001) # The area enclosed by the gini curve and the x-axis auuc = gini_curve_area - random_curve_area fig, ax = plt.subplots() ax.plot(x, y, c = 'green', label = 'gini curve') ax.plot([0,10], [0, y[len(y) - 1]], c = 'blue', ls = '--', label = 'random curve') ax.text(x = 7.5, y = 0.02, s = 'auuc: %f'%auuc )