Game player behavior data analysis and prediction based on python

Analysis and prediction of game player behavior data

1. Introduce project background and analysis objectives

1. Requirements and application scenarios

With the continuous development of the game industry, more and more game companies need to conduct data analysis on game operations to optimize operation strategies and improve user retention and revenue. Analysis of game operations can help companies understand user behavior, revenue sources, market trends, etc., and guide companies to tailor marketing and user management strategies. Application scenarios for game operation analysis requirements include:

1. Game companies optimize games based on user data

By analyzing data such as game behaviors, retention rates, and payment habits of different users, game companies can accurately locate the needs and habits of different user groups, provide personalized game services for different users, optimize user experience, and increase retention rates and revenue.

2. Tailor-made marketing strategy

Game companies can formulate more precise marketing strategies by analyzing market trends and competition. According to the characteristics of the user’s region, age, gender, game preference, etc., develop corresponding game products and marketing plans, and optimize them for user needs.

3. Monitor the operation of the game and adjust the strategy in time

Game companies can conduct real-time monitoring based on user data and game data, grasp the operating status of the game, adjust strategies in a timely manner, and improve the user experience and profitability of the game.

2. Analysis goals

Taking a game operation situation analysis project as an example, the analysis objectives include:

1. User behavior analysis

By analyzing user data, including user active time, user retention time, user level distribution, proportion of paying users, etc., it provides game companies with data references on user preferences and behavior habits. Based on user behavior analysis, game companies can more accurately understand user needs, optimize game services, provide better user experience, and increase retention and payment rates.

2. Analysis of game revenue sources

By analyzing the data of game revenue sources, game companies can understand the correlation between different revenue sources, identify the most important source of income, understand why this source of income is the most important, and analyze and improve the payment habits of each channel. Such data reference can help game companies determine the most important source of income, and adjust and optimize the distribution strategy of income sources.

In short, the analysis of game operation can help game companies understand the operation of the game, formulate reasonable countermeasures for different problems, optimize operation strategies, and improve profitability.

2. Dataset source and description

All the data in this course report comes from the public data on Data Castle, the link is

This data set contains a total of more than 20,000 pieces of data and 110 features. In order to facilitate the analysis, 11 features are selected for analysis.

‘user_id’ player unique ID
‘avg_online_minutes’, average online minutes
‘pvp_battle_count’ number of battles with the player
‘pvp_lanch_count’ The number of times the player initiates a battle with the player
‘pvp_win_count the number of times the player has won the game
‘pve_battle_count’ number of battles against the computer
‘pve_lanch_count’, the number of times to initiate a game against the computer
‘pve_win_count’ the number of wins against the computer

user_id player unique ID
pay_price recharge amount
pay_count recharge times
prediction_pay_price prediction recharge amount

3. Application of big data analysis technology

1. Data preprocessing code, annotations and running results

1. Import datasets and libraries

df = pd.read_csv(‘./data/game_player.csv’,encoding=’gbk’)

2. Slice the required features,

#Slice extracts the required features and names them data
data = df[
[
‘user_id’, #player unique ID

 'avg_online_minutes', #Online duration
                         'pvp_battle_count', #The number of battles with players
                         'pvp_lanch_count', #The number of active battles with players
                         'pvp_win_count', #The number of times the player won the battle
                         'pve_battle_count', #The number of battles with the computer
                         'pve_lanch_count', #The number of times to initiate battles with the computer
                         'pve_win_count' #The number of wins against the computer
                        ]
                   ]

Data


3. Delete missing values and deduplicate

# remove missing values
print(‘The shape of the data set before removing the missing row is:’, data.shape)
data_1 = data.dropna(axis=0, how=’any’)
print(‘The shape of the data set after removing the missing rows is:’, data_1.shape)
#If shape is used, the data1 array will become str string type

#data1 = data_1[‘user_id’].drop_duplicates()
data1 = data_1.drop_duplicates()
print(‘The total number of game IDs after using the drop_duplicates method to remove duplicates:’, len(data1))

And name the processed data set as data1

4. Similarity matrix of three features

#Find the number of battles with players, the number of times to initiate battles with players, the number of times to win battles with players, and the similarity matrix of the three-featured pearson method
corr_data1 = data[[‘pvp_battle_count’, ‘pvp_lanch_count’, ‘pvp_win_count’]].corr(method=’pearson’)
print(‘The number of battles with players, the number of times to initiate battles with players, the number of times to win battles with players:\\
‘, corr_data1)

5. Cut out the required new feature and name it data2

6. Standardization of deviation

#Standardization of deviation
#Custom deviation normalization function
def min_max_scale(data1):
data1 = (data1 – data1.min())/ (data1.max() – data1.min())
return data1
# Standardize the dispersion of the average online time
time_min_max = min_max_scale(data1[‘avg_online_minutes’])
print(‘Online time data before standardization:\\
‘, data1[‘avg_online_minutes’])
print(‘The online time data after deviation standardization is:\\
‘, time_min_max)

7. Inner join, outer join, save the preprocessed data set

#Merge data 1 and 2
print(‘Outer join combined data frame size’,
pd.concat([data1,data2],axis=1,join=’outer’).shape)

#Merge data 1 and 2
print(‘The size of the data frame after the inner connection is merged’,
pd.concat([data1,data2],axis=1,join=’inner’).shape)

data3 = pd. merge(data1, data2, how=’inner’, on=’user_id’)
data3.to_csv(‘./data/Wu Shuoqiu 202006180058.csv’, sep=’;’, index=False)

2. Data exploration and feature construction

1. Player activity analysis

(1) Calculate the average online time of all players
avg_time = data3.avg_online_minutes.mean()
avg_time

(2) #Calculate the average online duration of paying players
pay_avg_time = data3[data3.pay_price > 0].avg_online_minutes.mean()
pay_avg_time

#Use the equal-width discrete method to record the distribution of recharge times
pay_cut = pd.cut(data2[‘pay_count’],40)
print(‘Discretization distribution of recharge times:\\
‘, pay_cut.value_counts())

(3) Draw the player average online time boxplot

Draw a box plot of the average online time of all players

plt.figure(figsize=(10,10))
plt.boxplot(data3.avg_online_minutes)
plt.rcParams[‘font.sans-serif’]=[‘Microsoft YaHei’]
plt.rcParams[‘axes.unicode_minus’]=False
plt.title(‘Box plot of average online time of all players’)
plt. show()

Draw a boxplot of the average online time of paid players

plt.figure(figsize=(10,10))
plt.boxplot(data3[data3.pay_price > 0].avg_online_minutes)
plt.rcParams[‘font.sans-serif’]=[‘Microsoft YaHei’]
plt.rcParams[‘axes.unicode_minus’]=False
plt.title(‘Paid player average online time boxplot’)
plt. show()

Count players who played PvP games

pvp_avg_time = data3[data3.pvp_battle_count > 0].avg_online_minutes.mean()

evaluate
The average online time of all players is 9.6 minutes, and the average online time of paying players is 135.8 minutes, which is about 11 times that of ordinary players. Paying players have a higher degree of activity.

2. Player payment rate analysis

(1) Obtain the number of players whose payment times exceed 0

(2) Draw a pie chart

3. Player payment analysis and correlation exploration

(1) Define HY, total_pay, HY_AVG, HY_PAY_COUNT, PAY_AVG, PAY_PRO
The meanings are number of active players, total revenue, average revenue per active player, number of active paying players, average revenue per paying player, payment rate

(2) The relationship between active players and recharge amount

The payment rate of this game is low, and there is still room for further improvement. Related activities can be carried out to increase the payment rate.
The per capita consumption of paying players in this game is 32, which shows that paying users have strong spending power as a whole. Follow-up analysis can be made on paying users to ensure their continuous payment;

4. Analysis of players’ gaming habits

Average PVP times of active users

HY_pvp_battle_coun = data3[data3.avg_online_minutes> 10].pvp_battle_count.mean()

The total number of active pvp users

HY_count_pvp = data3[data3.avg_online_minutes> 10].pvp_battle_count.sum()

Number of active user pvp initiations

HY_count_lanch_pvp = data3[data3.avg_online_minutes> 10].pvp_lanch_count.sum()

Probability of active users actively initiating PVP

HY_rate_lanch_pvp = HY_count_lanch_pvp/HY_count_pvp

The total number of PVP victories of active users

HY_num_win_pvp = data3[data3.avg_online_minutes> 10].pvp_win_count.sum()

Active user PVP victory probability

HY_rate_win_pvp = HY_num_win_pvp/HY_count_pvp

print(f’Average PVP times of active users: {HY_pvp_battle_coun}’)
print(f’Probability of active users to initiate PVP: {HY_rate_lanch_pvp}’)
print(f’active user PVP winning probability: {HY_rate_win_pvp}’)

(2)

Average PVE times of active users

HY_pve_battle_coun = data3[data3.avg_online_minutes> 10].pve_battle_count.mean()

The total number of PVE active users

HY_count_pve = data3[data3.avg_online_minutes> 10].pve_battle_count.sum()

Active user PVE initiation times

HY_count_lanch_pve = data3[data3.avg_online_minutes> 10].pve_lanch_count.sum()

Probability of active users to initiate PVE

HY_rate_lanch_pve = HY_count_lanch_pve/HY_count_pve

The total number of PVE victories of active users

HY_num_win_pve = data3[data3.avg_online_minutes>=15].pve_win_count.sum()

Active user PVE victory probability

HY_rate_win_pve = HY_num_win_pve/HY_count_pve

print(f’Average PVE times of active users: {HY_pve_battle_coun}’)
print(f’Probability of active users to initiate PVE: {HY_rate_lanch_pve}’)
print(f’Active user PVE win probability: {HY_rate_win_pve}’)

(3)

Average PVP times of active paying users

HY_PAY_COUNT_pvp_battle_coun = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_battle_count.mean()

The total number of active paid pvp protection

HY_PAY_COUNT_count_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_battle_count.sum()

Number of active paid pvp initiations

HY_PAY_COUNT_count_lanch_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_lanch_count.sum()

Probability of active paying users to initiate PVP

HY_PAY_COUNT_rate_lanc_pvp = HY_PAY_COUNT_count_lanch_pvp/HY_PAY_COUNT_count_pvp

The total number of PVP victories of active paying users

HY_PAY_COUNT_num_win_pvp = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pvp_win_count.sum()

PVP victory probability for active paying users

HY_PAY_COUNT_rate_win_pvp = HY_PAY_COUNT_num_win_pvp/HY_PAY_COUNT_count_pvp

print(f’Average PVP times of active paying users: {HY_PAY_COUNT_pvp_battle_coun}’)
print(f’Probability of active paying users to initiate PVP: {HY_PAY_COUNT_rate_lanc_pvp}’)
print(f’PVP win probability for active paying users: {HY_PAY_COUNT_rate_win_pvp}’)

(4)

Average PV times of active paying users

HY_PAY_COUNT_pve_battle_coun = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_battle_count.mean()

Total number of paid pve protection

HY_PAY_COUNT_count_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_battle_count.sum()

Number of paid pvp initiations

HY_PAY_COUNT_count_lanch_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_lanch_count.sum()

Probability of paying users to actively initiate PV

HY_PAY_COUNT_rate_lanc_pve = HY_PAY_COUNT_count_lanch_pve/HY_PAY_COUNT_count_pve

Total number of paid user Pve victories

HY_PAY_COUNT_num_win_pve = data3[(data3.avg_online_minutes>10) & (data3.pay_price>0)].pve_win_count.sum()

Pve winning probability for paid users

HY_PAY_COUNT_rate_win_pve = HY_PAY_COUNT_num_win_pve/HY_PAY_COUNT_count_pve

print(f’Average PVE times of paying users: {HY_PAY_COUNT_pve_battle_coun}’)
print(f’Probability of paying users to initiate PVE: {HY_PAY_COUNT_rate_lanc_pve}’)
print(f’paying user PVE winning probability: {HY_PAY_COUNT_rate_win_pve}’)

visualization

Comment

1) The average number of PVE and PVP times of active paying players is higher than that of active players, and active paying players are more willing to spend time on this game;
2) In the PVP battle, the winning return rate of active paying players is far higher than that of active players, which shows that our game props can allow APA to enjoy the fun of winning the battle;

3. The source code, annotations and operation results of the construction and evaluation of the classification model

1. First construct a feature heat matrix to understand the relationship between each feature.

This part needs to complete the regression model and classification model and compare them, so it is particularly important to understand the relationship between each feature. Before that, create a new feature feature, and define the player whose online time is less than half of the average online time of all players as a feature. And import data3 as the last feature.

2. Model construction data set division

Select the new features created in the previous section for analysis.

This part divides the data label, trains the test set, and standardizes the data set, setting the data set to a state that the algorithm can call directly.
By the way, plot the confusion matrix.

Divided data tags

data3_data = data3.iloc[:, :-1]
data3_target = data3.iloc[:, -1]
#Split training set and test set
from sklearn.model_selection import train_test_split
data3_data_train, data3_data_test, data3_target_train, data3_target_test = train_test_split(data3_data, data3_target, test_size=0.2, random_state=66)

Standardized dataset

from sklearn.preprocessing import StandardScaler
stdScale = StandardScaler().fit(data3_data_train)
data3_trainScaler = stdScale. transform(data3_data_train)
data3_testScaler = stdScale. transform(data3_data_test)

# confusion matrix
from sklearn.metrics import confusion_matrix
def test_pre(pred):
hx = confusion_matrix(data3_target_test, pred)
print(‘Confusion Matrix:\\
‘,hx)

#accurate rate
P = hx[1,1]/ (hx[0,1] + hx[1,1])
print('accurate rate:\\
', round(P, 3))

# recall rate
R = hx[1,1]/ (hx[1,0] + hx[1,1])
print('recall rate:\\
', round(P, 3))

#F1 value
F1 = 2 * P * R / (P + R)
print('F1 value:', round(F1, 3))

undersampling

2. SVM algorithm to build a classification model, and evaluate, ROC

Use the SVM algorithm to predict the data set and display the top 20 prediction results. The prediction yielded 2125 correct results and 69 wrong results with an accuracy rate of 96%

evaluation section
F1 values are 0.98 and 0.94

Draw the ROC curve

3. Construct and evaluate using Gaussian Naive Bayes

It is consistent with the SVM approach, the only difference is the algorithm

evaluation model,
Compared with SVM, the accuracy rate of this model is slightly lower, which is 75%. According to the classification report results, the accuracy rate is 0.98 and 0.54, and the f1 value is 0.80 and 0.70. Compared with the svm model, there is a big gap. All figures are lower.

Draw the ROC curve

4. Source code, annotations and running results of regression model construction and evaluation

1. Divide the dataset

The feature of this experiment is “recharge times”
This part divides the data label, trains the test set, and standardizes the data set, setting the data set to a state that the algorithm can call directly.

2. Experimental random forest regression model construction

Using the Regression Forest Tree Algorithm

Draw a visualization of regression results
from matplotlib import rcParams
plt.rcParams[‘font.sans-serif’]=[‘Microsoft YaHei’]
fig = plt.figure(figsize=(12, 6))
plt.plot(range(data3_target_test.shape[0]), list(data3_target_test), color=’blue’
)
plt.plot(range(data3_target_test.shape[0]),y_pred, color=red’, linewidth=2.5 ,linestyle=’-.’
)
plt.xlabel(‘result value’)
plt.ylabel(‘online rate’)
plt.legend([‘real result’,’predicted result’])
plt. show()

Print and view regression report
Import the report and view the forest model to view various data.

3. Using support vector regression model construction

model building

Results visualization

Print and view regression report

Evaluate and compare two models
Compared with the vector machine model, the random forest method model has a higher R square value of 0.84, and the explained variance is also 0.84.

5. Comparison and explanation of various model analysis results

Evaluate and compare two regression models

Compared with the vector machine model, the random forest method model has a higher R square value of 0.84, and the explained variance is also 0.84.

Comparative evaluation of classification models

Evaluation model, compared with SVM, the accuracy rate of this model is slightly lower, which is 75%. According to the classification report results, the accuracy rate is 0.98 and 0.54, and the f1 value is 0.80 and 0.70. Compared with the svm model, there is a big gap. All figures are lower.

6. Application of data visualization technology

1. The source code, operation results and brief description of the first data visualization technology application

The drawing of the box plot, the box plot can clearly display the five statistics of the data, the minimum value, the lower quartile, the median, the upper quartile and the maximum value, and it is concise and easy to understand. Since this visualization is an analysis of the average online time, a comprehensive comparison of multiple indicators is required, so it is very suitable for drawing a boxplot.

Draw a boxplot of the average online time of paying players

plt.figure(figsize=(10,10))
plt.boxplot(data3[data3.pay_price > 0].avg_online_minutes)
plt.rcParams[‘font.sans-serif’]=[‘Microsoft YaHei’]
plt.rcParams[‘axes.unicode_minus’]=False
plt.title(‘Paid player average online time boxplot’)
plt. show()

2. The source code and running results of the second data visualization technology, brief description

The advantage of the pie chart is that it emphasizes the proportion and is easy to understand. This time, we choose to compare the gap between the number of paying players and non-paying players. For such a small data set, the pie chart can accurately see the proportion and gap between the two.

Make a pie chart of percentage

plt.figure(figsize=(8,8))

Drawing

patches, l_text, p_text = plt.pie([22877 – pay_num, pay_num],
labels=[‘unpaid’,’paid’],
labeldistance = 0.3,
colors=[‘#87CEFA’,’#FFC0CB’],
explode=[0.01,0.05],
autopct=’%1.1f%%’,
pctdistance=1.15)

Set label size

for t in l_text:
t. set_size(20)

Set percent font size

for t in p_text:
t. set_size(20)

Set Heading

plt.title(‘The ratio of paid users to all users’, size=25)
plt. show()

3. The source code, running results and brief description of the third data visualization technology application

The advantage of the histogram is that the numerical difference between different categories can be represented by height, and the simplicity and efficiency of a single drawing make it easier to compare. Therefore, it is easy to make different pairwise comparisons of each data in one graph to show the gap.

plt.figure(figsize=(15,8))

AU players

plt.bar([0.75,2.75,4.75,6.75],[HY_rate_lanch_pve, HY_rate_win_pve, HY_rate_lanch_pvp, HY_rate_win_pvp],width=0.5,alpha=0.5,label=’active HY player’)
plt.bar([1.25,3.25,5.25,7.25],[ HY_PAY_COUNT_rate_lanc_pve, HY_PAY_COUNT_rate_win_pve , HY_PAY_COUNT_rate_lanc_pvp, HY_PAY_COUNT_rate_lanc_pvp], width=0.5, color=r’, alpha= 0.5, label=’Paid Active Player’)
plt.xticks([1,3,5,7],[‘Probability of actively initiating PVE’,’PVE winning probability’,’Probability of actively initiating PVP’,’PVP winning probability’])
plt. legend()
plt. show()

7. Course conclusion and experience

For this course report, all the knowledge learned in the previous class is summarized and integrated. The most important ones include data preprocessing, exploratory analysis, creation of regression models and classification models with a new understanding.

For data preprocessing, data preprocessing is an important step in any data analysis, machine learning or deep learning project, the purpose of which is to ensure no missing data, standardized, clear and consistent data sets so that various analyzes can be better performed , model fitting or forecasting. Data cleaning is the first step in data preprocessing, including removing duplicate values, solving missing values, and outlier processing. Then feature processing is another important part of data preprocessing. Feature processing includes selecting, extracting, and transforming data features. After the first step of cleaning, it is necessary to sort out the data and extract fine features. The success of data preprocessing is important before large-scale data analysis and machine learning phases, as it helps to improve accuracy and maximize the clarity of samples from noise and determinism that drives final results.
For exploratory analysis, prior to the data preprocessing and modeling stages, drill down to all the details of the dataset and obtain as much information and potential stakes as possible in order to efficiently plan the next algorithm and modeling tasks and reduce risk.
Regression models are a common supervised learning task for predicting numerical variables, which not only can improve the accuracy and precision of data, but also play an important role in many practical problems. Understanding different types of regression models and evaluation criteria, and mastering optimization strategy skills will lead to better selection, construction, and validation of models and better predictive results.
Classification model is a common supervised learning task, which not only can improve the classification accuracy and accuracy of data, but also plays an important role in many practical problems. Understand the classification types and evaluation indicators of classification models, and master optimization techniques and methods.
For the harvest of large jobs, the most important thing is the details. For example, for data deduplication during preprocessing, drop must be used instead of list or others, because only drop will not change the structure of the array. The second is about the classification part. Before the prediction, a heat map must be drawn to check the correlation of each feature for intuitive understanding.