PYTHON user churn data mining: establishing logistic regression, XGBOOST, random forest, decision tree, support vector machine, naive Bayes and KMEANS clustering user portraits…

Original link: http://tecdat.cn/?p=24346

In today’s brand marketing stage where products are highly homogeneous, the competition between enterprises is concentrated in the competition for customers(Click “Read the original text” at the end of the article to get the complete Tag data).

“Users are God” prompts many companies to compete for as many customers as possible at all costs. However, in the process of developing new users at all costs, companies often ignore or have no time to take into account the loss of existing customers. As a result, there is such a dilemma: on the one hand, new customers are continuously increasing, and on the other hand, the number of new customers is increasing. The customers you worked so hard to find are quietly disappearing. Therefore, it is crucial and urgent to conduct data analysis on the churn of old users to dig out important information to help corporate decision-makers take measures to reduce user churn.

outside_default.png

1.2 Purpose:

In-depth understanding of user portraits and behavioral preferences, excavating key factors affecting user churn, and predicting the conversion results of customer visits through algorithms to better improve product design and enhance user experience.

Related videos

1.3 Data Description:

This data is a week’s access data of Ctrip users. In order to protect customer privacy, the data has been desensitized. There are some differences with the actual product order volume, page views, conversion rate, etc., which does not affect the solvability of the problem.

2 Read data

# Show all features
df.head()

outside_default.png

3 Split data

# Divide training set and test set
X\_train, X\_test, y\_train, y\_test = train\_test\_split(X, y, test\_size=0.2, random\_state=666)

3.1 Understanding data

You can see that there are many variables. Let’s classify them first. After removing the target variable label, the fields of this data set can be divided into three categories: order-related indicators, customer behavior-related indicators, and hotel-related indicators.

outside_default.png

outside_default.png

4 Feature Engineering

# Use the training set for data exploration
train = pd.concat(\[X\_train,y\_train\],axis=1)

outside_default.png

4.1 Data preprocessing

outside_default.png

outside_default.png

4.1.1 Delete unnecessary columns

X_train.pop("sampleid")
X_test.pop("sampleid")
train.pop("sampleid")

outside_default.png

4.1.2 Data type conversion

String type features need to be processed into numeric types before they can be modeled. Subtract arrival and d to get “number of days booked in advance” as a new feature

# Add column
#Convert two date variables from string to date format type
train\["arrial"\] = pd.to_datimetain\["arrval"\])
X\_tst\["arival"\] = d.to\_daetime(X_est\["arival"\])
# Generate advance reservation time column (derived variable)
X\_trin\["day\_adanced"\] = (X_rain\["arival"\]-Xtrain\["d"\]).dt.days

## Delete column
X_tran.dro(columns="d","arrivl"\],inpace=True)

4.1.3 Generate an indicator dummy variable for variables with missing values

zsl = tain.isnll().sum()\[tain.isnll(.sum()!=0\].inex

4.1.4 Fill in vacancies based on business experience

ordernum_oneyear The number of user orders per year is 0, lasthtlordergap 11% is filled with 600000, 88% is filled with 600000, the length of time since the last order was placed within a year, ordercanncelednum is filled with 0, the number of orders canceled by the user within a year, ordercanceledprecent is filled with 0t, the user canceled the order within one year
Order rate 242114 242114 – There are two cases for empty: 1: empty for new users who have not placed an order – 88.42% 214097 2. empty for old users who have not consumed money for more than 1 year, add a code column for new users who have not placed an order and those who have not placed an order for 1 year Old users
price\_sensitive -0, the median fills the price sensitive index, consuming\_capacity -0 the median fills the consumption power index 226108 – empty situation 1. New users who have never placed an order 214097 2.12011 individuals are empty for the time being. unclear
uv\_pre – The hotel history with the most views in 24 hours uv. cr\_pre -0, Median fill – The hotel history with the most views in 24 hours cr -0, Median fill 29397 – Empty 1. User Did not log in to the APP that day 28633 2. New hotel just launched 178 586 No uv, cr record code added to the APP New hotel just launched 764 29397
customereval_pre2 is filled with 0 – the average value of hotel customer ratings in 24 hours of history, landhalfhours – the login duration within 24 hours – is filled with 0 28633 – empty: the user has not logged in to the APP that day 28633
hotelcr ,hoteluv – median fill 797

Just checked into a new hotel 60 #Not logged into APP 118

avgprice 0 fills in part of the price with 0, the number of people who have not placed an order in the past year, cr fills with 0,

tkq = \["hstoryvsit\_7ordernm","historyviit\_visit\_detaipagenum","frstorder\_b","historyvi
# tbkq = \["hitoryvsit\_7dernum","hisryvisit\_isit_detailagenum"\]

X_train\[i\].fillna(0,inplace=True)
## Fill part with 0 and fill part with median
# Related attributes affected by new users: ic\_sniti, cosuing\_cacity
n\_l = picesensitive","onsmng\_cpacty"\]
fori in n_l
X\_trini\]\[Xra\[X\_trinnew_ser==1\].idex\]=0
X\_est\[i\]\[X\_test\[X\_test.nw\_user==1\].inex\]=0

4.1.5 Outlier processing

Treat negative values in customer\_value\_profit and ctrip_profits as 0
Process negative values in delta\_price1, delta\_price2, and lowestprice as medians

for f in flter_two:
a = X_trin\[\].median()
X\_tran\[f\]\[X\_train\[f\]<0\]=a
X\_test\[f\]\[X\_est\[\]<0\]=a
tran\[f\]\[train\[f\]<0\]=a

4.1.6 Missing value filling

Fields that tend to be normally distributed are filled with the mean: businessrate\_pre2, cancelrate\_pre, businessrate_pre; fields with skewed distribution are filled with the median.

def na_ill(df):
for col in df.clumns:
mean = X_trai\[col\].mean()

dfcol\]=df\[col\].fillna(median)
return
## Derivative variable annual transaction rate
X \ _ train \ [\ "onear \ _ dalaang" \] = \ _ tain \ [\ "lyritum \ _ onyear " \]/x \ _ transinum \ _onyar \ "" \ "" \ "" \ "" \ "" \]
X\_st\["onyardealae"\]=X\_st\["orernum_neyear"\]/Xtest\[visitumonyear"\]
X_al =pd.nca(\[Xtin,Xtes)
#Decision tree test

dt = Decsionr(random_state=666)

pre= dt.prdict(X_test)
pre\_rob = dt.preicproa(X\_test)\[:,1\]
pre_ob

outside_default.png

4.2 Data standardization

scaler = MinMacaer()

#decision tree test
dt = DeonTreasifi(random_state=666)

5 feature filtering

5.1 Feature Selection-Delete 30% Columns

X\_test = X\_test.iloc\[:,sp.get_spport()\]
#decision tree test
dt = DecisonreeClssifie(random_state=666)
dt.fit(X\_trin,y\_tain)
dt.score(X\_tst,y\_est)
pre = dt.pdict(X_test)
pe\_rob = dt.redicproba(X\_test)\[:,1\]
pr_rob

uc(pr,tpr)

outside_default.png

5.2 Collinearity/Data Correlation

#Collinearity--Serious collinearity above 0.9, merge or delete
d = Xtrai.crr()
d\[d<0.9\]=0 #Assignment displays highly correlated variables
pl.fufsiz=15,15,dpi200)
ssheatp(d)

outside_default.png

6 Modeling and model evaluation

6.1 Logistic regression

y\_prob = lr.preictproba(X\_test)\[:,1\]
y\_pred = lr.predict(X\_test
fpr\_lr,pr\_lr,teshold\_lr = metris.roc\_curve(y\_test,y\_prob)
ac\_lr = metrcs.aucfpr\_lr,tpr_lr)
score\_lr = metrics.accuracy\_score(y\_est,y\_pred)
prnt("Model accuracy: {0}, AUC score: {1}".fomat(score\_lr,auc\_lr))
prit("="*30

outside_default.png

6.2 Naive Bayes

gnb = GasinNB() # Instantiate an LR model
gnb.fi(trai,ytran) #train model
y\_prob = gn.pic\_proba(X_test)\[:,1\] # Predict the probability of class 1
y\_pred = gnb.preict(X\_est) # The prediction result of the model on the test set
fpr\_gnb,tprgnbtreshold\_gb = metrics.roc\_crve(ytesty\_pob) # Get the true positive rate, false positive rate, and threshold
aucgnb = meic.aucf\_gnb,tr\_gnb) # AUC score
scoe\_gnb = merics.acuray\_score(y\_tes,y\_pred) # Model accuracy

outside_default.png

6.3 Support vector machine

s =SVkernel='f',C=,max_ter=10,randomstate=66).fit(Xtrain,ytrain)
y\_rob = sc.decsion\_untio(X_st) # Decision boundary distance
y\_ed =vc.redit(X\_test) # The prediction results of the model on the test set
fpr\_sv,tpr\_vc,theshld\_sv = mtris.rc\_urv(y\_esty\_pob) # Get the true positive rate, false positive rate, and threshold
au\_vc = etics.ac(fpr\_sc,tpr_sv) # Model accuracy
scre\_sv = metrics.ccuracy\_sore(_tst,ypre)

outside_default.png

6.4 Decision tree

dtc.fit(X\_tran,\_raiproba(X_test)\[:,1\] # Predict the probability of class 1
y\_pred = dtc.predct(X\_test # The prediction result of the model on the test set
fpr\_dtc,pr\_dtc,thresod\_dtc= metrcs.roc\_curvey_test,yprob) # Get the true positive rate, false positive rate, and threshold

outside_default.png

6.5 Random Forest

c = RndoForetlassiir(rand_stat=666) # Create a random forest
rfc.it(X_tain,ytrain) # Train random forest model
y\_rob = rfc.redict\_poa(X_test)\[:,1\] # Predict the probability of class 1
y\_pedf.pedic(\_test) # Model prediction results for the test set
fpr\_rfc,tp\_rfc,hreshol\_rfc = metrcs.roc\_curve(y\_test,\_prob) # Get the true positive rate, false positive rate, and threshold
au\_fc = meris.auc(pr\_rfctpr_fc) # AUC score
scre\_rf = metrcs.accurac\_scor(y\_tes,y\_ped) # Model accuracy

outside_default.png

6.6 XGboost

# Read training data set and test set
dtainxgbatrx(X_rai,yrain)
dtest=g.DMrx(Xtest
#Set xgboost modeling parameters
paras{'booser':'gbtee','objective': 'binay:ogistic','evlmetric': 'auc'

#Train model
watchlst = (dtain,'trai)
bs=xgb.ran(arams,dtain,n\_boost\_round=500eva=watchlst)
# Enter the probability value of the predicted positive class
y_prob=bst.redict(dtet)
#Set the threshold to 0.5 to get the prediction results of the test set
y\_pred = (y\_prob >= 0.5)*1
# Get the true positive rate, false positive rate, and threshold
fpr\_xg,tpr\_xgb,heshold\_xgb = metricsroc\_curv(test,y_prob)
aucxgb= metics.uc(fpr\_gb,tpr\_xgb # AUC score
score\_gb = metricsaccurac\_sore(y\_test,y\_pred) # Model accuracy
print('Model accuracy: {0}, AUC score: {1}'.format(score\_xgb,auc\_xgb))

outside_default.png

6.7 Model comparison

plt.xlabel('false positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.savefig('Model comparison diagram.jpg',dpi=400, bbox_inches='tight')
plt.show()

outside_default.png

Click on the title to view previous issues

outside_default.png

Python performs lstm and xgboost sales volume time series modeling and forecasting analysis on store data

outside_default.png

Swipe left or right to see more

outside_default.png

01

outside_default.png

02

outside_default.png

03

outside_default.png

04

outside_default.png

6.8 Important Features

ea = pd.Sries(dct(list((X\_trclumsfc.eatre\_imortancs_))))
ea.srt_vlues(acedig=False
fea\_s = (fa.srt\_vauesacnding=alse)).idex

outside_default.png

outside_default.png

6.9 Analysis of reasons for loss

  • When the values of cityuvs and cityorders are small, user churn is significantly higher than the average level, indicating that the Ctrip platform lacks hotel information in small cities, and users turn to competing products with more complete hotel information in small cities, resulting in user churn.

  • Access time points between 7:00 and 19:00 have a high and average user churn rate: push notifications on weekdays should avoid these time points

  • The user churn of the hotel business attribute index in the range of 0.3-0.9 is greater than the average level and shows an increasing trend, indicating that there is a gap between the hotels with high platform business index and user expectations (the price is too high or other reasons?), and the user churn of low business attributes is higher. few

  • The shorter the time since the last order within a year, the more serious the loss will be. Affected by the negative news that Ctrip broke from May 2015 to January 2016, companies should strengthen their own management and establish a good social image.

  • Users with a low spending power index (10-40) are losing more seriously. This group of users accounts for 50% and should be taken seriously.

  • The price sensitivity index (5-25) has higher than average churn and focuses on hotel quality

  • The higher the user conversion rate, the number of user annual orders, and the number of user historical orders in the past year, the greater the proportion of people who have not visited the order filling page within 24 hours, and the more serious the loss. It is necessary to provide a good tracking experience for users after placing an order, and invite them to fill in the check-in form. Experience, organize opinions and make improvements

  • The shorter the number of advance booking days, the more serious the churn will be. The higher the number of canceled orders within a year, the more serious the churn will be.

6.10 Recommendation:

outside_default.png

  • Consider taking market share in third- and fourth-tier cities and low-end hotels

  • Users are easily affected by negative corporate news. It is recommended that companies have the courage to assume social responsibilities, strengthen their own management, improve the timeliness of handling public relations news, and establish a good brand image.

  • Start pushing popular attractions and hotels 2-3 weeks before holidays

  • Track the hotel experience after placing an order, invite people to fill in their check-in experience, and organize user opinions to make improvements.

7 Customer Portraits

7.1 Modeling user classification

# User portrait characteristics
user\_feature = \["decisiohabit\_user,'starprefer','lastpvgap','sid',
'lernum",'historyvisit\_visit\_detaipagenum',
"onyear_dealrat
\]
# Churn impact characteristics
fea_lis = \["cityuvs",
"cityorders",
"h",
"businessrate_pre2"

# Data standardization Kmeans method is better for processing normally distributed data
scaler = StanardScalr()
lo\_atribues = pdDatarame(scr.fittransfrm(all\_cte),columns=all_ce.coluns)

# Modeling classification
Kmens=Means(n\_cluste=2,rndom\_state=0) #333
Keans.fi(lot_attributes #Training model
k\_char=Kmenscluster\_centers_ # Get each category
plt.figure(figsize=(5,10))

outside_default.png

outside_default.png

7.2 Proportion of user types

types=\['High-value users','Potential users'\]
ax.pie\[1\], raius=0.,colors='w')
plt.savefig(user portrait.jpg'dpi=400, box_inchs='tigh')

outside_default.png

7.3 High-value user analysis

Accounting for 19.02, the visit frequency and booking frequency are high, the consumption level is high, the customer value is high, the pursuit of high quality, the hotel star requirements are high, and the customer group is mostly concentrated among old customers.
suggestion:
Recommend more business hotel chain hotel listings with good reputation and high cost performance to attract users;
Push messages during small peak traffic hours during the day such as 11:00 and 17:00 on non-working days.
Provide customers with more hotel information in their travel destinations;
Increase the cost of customer churn: membership point system, launch of membership discount card

7.4 Analysis of potential users

Proportion: 80.98% The frequency of visits and reservations is low, the consumption level is low, the hotel star requirements are not high, the customer groups are mostly concentrated among new customers, and customer value needs to be tapped. Suggestions:
Because most new users are potential customers, it is recommended to grasp the initial user experience (such as discounts for initial purchases, check-in activities, etc.), and regularly push affordable hotels to such users to cultivate user consumption inertia;
The content pushed should mostly be about big sales, big promotions, low prices, etc.;
Since this group of users accounts for a relatively large proportion, the factors of lost customers can be analyzed based on the loss situation of this group, and the market of this group can be developed, and further analysis of sinking can be carried out to open up new time periods.

About the author

Lijie Zhang has strong logical thinking ability, considers problems comprehensively, is proficient in data cleaning and data preprocessing, drawing and visual display, is familiar with machine learning libraries such as sklearn and xgboost for data mining and data modeling, and masters linear regression and logistic regression of machine learning. , principal component analysis, clustering, decision tree, random forest, xgboost, svm, neural network algorithm.

outside_default.png

Excerpts from this articlePYTHON User Churn Data Mining: Establishing Logistic Regression, XGBOOOST, Random Forest, Decision Tree, Support Vector Machine, Naive Bayes and KMEANS Clustering User Portraits , click “Read original text” to get the full text.

outside_default.png

outside_default.png

Click on the title to view previous issues

Python performs lstm and xgboost sales volume time series modeling and forecasting analysis on store data

PYTHON integrated machine learning: integrated model classification and regression using ADABOOST, decision trees, logistic regression and grid search hyperparameter optimization

R language integrated model: tree boosting, random forest, constrained least squares weighted average model fusion analysis of time series data

Python performs lstm and xgboost sales volume time series modeling and forecasting analysis on store data

R language uses principal component PCA, logistic regression, decision tree, and random forest to analyze heart disease data and visualize it in high dimensions.

Tree-based methods in R language: decision trees, random forests, bagging, boosted trees

R language uses logistic regression, decision trees and random forests to classify and predict credit data sets

spss modeler uses decision tree neural network to predict ST stocks

Automatically combine feature factor levels using linear models and regression decision trees in R language

Implementation of CART regression decision tree with self-compiled Gini coefficient in R language

R language uses rle, svm and rpart decision trees for time series forecasting

python predict NBA winner with decision tree and random forest in Scikit-learn

Using scikit-learn and pandas decision tree in python for iris iris data classification modeling and cross-validation

Nonlinear models in R language: polynomial regression, local splines, smoothing splines, generalized additive model GAM analysis

R language uses standard least squares OLS, generalized additive model GAM, and spline functions for logistic regression LOGISTIC classification

R language ISLR salary data for polynomial regression and spline regression analysis

Polynomial regression, local regression, kernel smoothing and smoothing spline regression models in R language

R language uses Poisson regression and GAM spline model to predict the number of cyclists

R language quantile regression, GAM spline curve, exponential smoothing and SARIMA for power load time series forecasting

R language spline, decision tree, Adaboost, gradient boosting (GBM) algorithm for regression, classification and dynamic visualization

How to build an ensemble model in machine learning using R language?

R language ARMA-EGARCH model and integrated forecasting algorithm predict SPX actual volatility

Calculate neural network ensemble model in python deep learning Keras

R language ARIMA ensemble model forecasting time series analysis

R language analyzes heart disease patients based on bagging classification, logistic regression, decision tree, and forest

Tree-based methods in R language: decision trees, random forests, bagging, boosted trees

R language based on Bootstrap linear regression prediction confidence interval estimation method

R language uses bootstrap and incremental methods to calculate generalized linear model (GLM) prediction confidence intervals

R language spline, decision tree, Adaboost, gradient boosting (GBM) algorithm for regression, classification and dynamic visualization

Python performs lstm and xgboost sales volume time series modeling and forecasting analysis on store data

R language random forest RandomForest, logistic regression Logisitc prediction of heart disease data and visual analysis

R language uses principal component PCA, logistic regression, decision tree, and random forest to analyze heart disease data and visualize it in high dimensions.

Matlab establishes SVM, KNN and Naive Bayes model classification to draw ROC curves

matlab uses quantile random forest (QRF) regression tree to detect outliers

outside_default.png

outside_default.png

outside_default.png

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 56026 people are learning the system