Python data analysis and mining practice (household water heater user behavior analysis and event recognition)

1. Introduction

In the process of using household water heaters, residents will form different usage habits due to regional climate, different regions, and user age and gender differences. If home furnishing companies can deeply understand the usage habits of their products in different user groups, and develop functions that meet customer needs and habits, they will be able to open up new markets.

This article will build a bathing event recognition model based on the BP neural network algorithm, and then identify the bathing events of users in different regions, and then compare different customer groups according to the recognition results to provide the most suitable personalized products and improve the intelligent research and development of new products And formulate corresponding marketing strategies.

2. Data exploration and analysis

Explore and analyze the flow status of the water heater, among which the attributes of “whether there is water flow” and “water flow” can most intuitively reflect the water flow of the water heater, and explore and analyze these two attributes.

import pandas as pd
import matplotlib.pyplot as plt

inputfile ='./data/original_data.xls' #'./original_data.xls' # input data file
data = pd.read_excel(inputfile) # read data

# Check the distribution with or without water flow
# Data Extraction
lv_non = pd.value_counts(data['with or without water flow'])['no']
lv_move = pd.value_counts(data['with or without water'])['with']
# draw bar chart
 
fig = plt.figure(figsize = (6 ,5)) # set the canvas size
plt.rcParams['font.sans-serif'] = 'SimHei' # set Chinese display
plt.rcParams['axes.unicode_minus'] = False
plt.bar([0,1], height=[lv_non,lv_move], width=0.4, alpha=0.8, color='skyblue')
plt.xticks([index for index in range(2)], ['no','yes'])
plt.xlabel('water flow status')
plt.ylabel('record number')
plt.title('Number of records of different water flow states (3001)')
plt. show()
plt. close()

# View water flow distribution
water = data['water flow']
# Draw a box diagram of water flow distribution
fig = plt.figure(figsize = (5 ,8))
plt.boxplot(water,
            patch_artist=True,
            labels = ['water flow'], # set x-axis title
            boxprops = {'facecolor':'lightblue'}) # set fill color
plt.title('Water flow distribution box plot (3001)')
# Display the bottom line of the y-axis
plt. grid(axis='y')
plt. show()

Get the following bar and box plots

It can be seen that there are obviously more records in the state of no water flow than in the state of water flow. The box is close to 0, indicating that there are many records of no water flow, and the distribution of water flow is consistent with the distribution of water flow state.

3. Data preprocessing

3.1 Attribute reduction

Water heaters collect a lot of water consumption data, and the main object of analysis is water heater users, so redundant attributes will be deleted, for example: “water heater number”, “whether there is water flow”, “energy-saving mode”.

import pandas as pd
import numpy as np
data = pd.read_excel('./data/original_data.xls')
print('The initial data shape is:', data.shape)
# Delete the water heater number, whether there is water flow, energy-saving mode attributes
data.drop(labels=["water heater number","with or without water flow","energy-saving mode"],axis=1,inplace=True)
print('The shape of the data after deleting redundant attributes is:', data.shape)
data.to_csv('./data/water_heart.csv', index=False)

The result is as follows

3.2 Classification of water use events

Water heater users’ water consumption data is stored in the database, which records various water consumption events, including bathing, washing hands, brushing teeth, washing face, laundry, washing vegetables, etc. What continuous data should be divided in a large number of state records in this case When a complete water use event. The time division of water use is mainly divided into two steps, that is, to determine the time interval of a single water use, and to calculate the time of two adjacent records.

# read data
data = pd.read_csv('./data/water_heart.csv')
# Divide water usage events
threshold = pd.Timedelta('4 min') # The threshold is 4 minutes
data['time of occurrence'] = pd.to_datetime(data['time of occurrence'], format = '%Y%m%d%H%M%S') # convert time format
data = data[data['water flow'] > 0] # As long as the flow is greater than 0 records
sjKs = data['Occurrence time'].diff() > threshold # Adjacent time forward difference, compare whether it is greater than the threshold
sjKs.iloc[0] = True # Let the first time be the start event of the first water use event
sjJs = sjKs.iloc[1:] # The result of backward difference
sjJs = pd.concat([sjJs,pd.Series(True)]) # Let the last time be the end time of the last water event
# Create a data frame and define the sequence of water events
sj = pd.DataFrame(np.arange(1,sum(sjKs) + 1),columns = ["event number"])
sj["Event start number"] = data.index[sjKs == 1] + 1 # Define the start number of water use events
sj["event termination number"] = data.index[sjJs == 1] + 1 # Define the termination number of the water use event
print('When the threshold is 4 minutes, the number of events is:', sj.shape[0])
sj.to_csv('./data/sj.csv', index = False)

The result is as follows

3.3 Determine the duration threshold of a single water use event

For the data of a water heater user, the water use events are divided according to different thresholds to obtain the corresponding number of events, and the number of time is obtained by threshold changes and divisions. Count the number of water use events under each threshold, and then find the optimal threshold through threshold optimization.

# Determine the duration threshold of a single water use event
n = 4 # use the average slope of the next four points
threshold = pd.Timedelta(minutes=5) # expert threshold
data['time of occurrence'] = pd.to_datetime(data['time of occurrence'], format='%Y%m%d%H%M%S')
data = data[data['water flow'] > 0] # As long as the flow is greater than 0 records
# Custom function: input the time threshold for dividing time, and get the number of divided events
def event_num(ts):
    d = data['Occurrence time'].diff() > ts # Make a difference between adjacent times, and compare whether it is greater than the threshold
    return d.sum() + 1 # This directly returns the number of events
dt = [pd.Timedelta(minutes=i) for i in np.arange(1, 9, 0.25)]
h = pd.DataFrame(dt, columns=['threshold']) # convert the data frame and define the threshold column
h['number of events'] = h['threshold'].apply(event_num) # Calculate the number of events corresponding to each threshold
h['slope'] = h['number of events'].diff()/0.25 # Calculate the slope corresponding to every two adjacent points
h['slope index']= h['slope'].abs().rolling(4).mean() # Take the average of the absolute value of n slopes forward as the slope index
ts = h['threshold'][h['slope indicator'].idxmin() - n]
# Use idxmin to return the index of the minimum value, because rolling_mean() calculates the absolute value average of the first n slopes
# So the result is translated (-n)
if ts > threshold:
    ts = pd.Timedelta(minutes=4)
print('The threshold value of the calculated single water use time is:',ts)

The result is as follows

3.4 Attribute construction

1) Construct water use duration and frequency attributes

The duration of water use for different water use events is one of the basic attributes. The attributes related to the duration of water use can only distinguish some water use events, and the water use pauses and frequencies of different water use events are also different. Build water usage duration and water usage frequency attributes.

data = pd.read_csv('./data/water_heart.csv', encoding='utf-8') # read water heater usage data records
sj = pd.read_csv('./data/sj.csv') # read water usage event records
# convert time format
data["occurrence time"] = pd.to_datetime(data["occurrence time"], format="%Y%m%d%H%M%S")

# Structural features: total water usage time
timeDel = pd. Timedelta("0.5 sec")
sj["event start time"] = data.iloc[sj["event start number"]-1,0].values- timeDel
sj["event end time"] = data.iloc[sj["event end number"]-1,0].values + timeDel
sj['bathing time'] = [i.hour for i in sj["event start time"]]
sj["total water consumption time"] = np.int64(sj["event end time"] - sj["event start time"])/1000000000 + 1

# Construct water pause event
# Construct features "pause start time", "pause end time"
# Pause start time means from water flow to no water flow, pause end time means from no water flow to water flow
for i in range(len(data)-1):
    if (data.loc[i,"water flow"] != 0) & amp; (data.loc[i + 1,"water flow"] == 0) :
        data.loc[i + 1,"Pause start time"] = data.loc[i + 1, "Occurrence time"] - timeDel
    if (data.loc[i,"water flow"] == 0) & amp; (data.loc[i + 1,"water flow"] != 0) :
        data.loc[i,"pause end time"] = data.loc[i, "occurrence time"] + timeDel
        
# Extract the line numbers corresponding to the start time and end time of the pause, and put them in the data frame Stop
indStopStart = data.index[data["stop start time"].notnull()] + 1
indStopEnd = data.index[data["stop time"].notnull()] + 1
Stop = pd.DataFrame(data={"Stop start number":indStopStart[:-1],
                            "Stop stop number":indStopEnd[1:]})
# Calculate the pause duration and put it in the data frame stop, pause duration = pause end time - pause end time
Stop["Stop duration"] = np.int64(data.loc[indStopEnd[1:]-1,"Stop end time"].values-
                     data.loc[indStopStart[:-1]-1,"stop start time"].values)/1000000000
# Match each pause with an event, the start time of the pause must be greater than the start time of the event,
# And the end time of the pause is less than the end time of the event
for i in range(len(sj)):
    Stop.loc[(Stop["Stop start number"] > sj.loc[i,"Event start number"]) & amp;
           (Stop["stop number"] < sj.loc[i,"event stop number"]),"stop attribution event"]=i + 1
             
# Delete events with a pause count of 0
Stop = Stop[Stop["Stop Attributed Event"].notnull()]

# Structural features The total duration of pauses in water events, the number of pauses, the average duration of pauses,
# Duration of water use, water usage/total duration
stopAgg = Stop.groupby("Stop attribution event").agg({"Stop duration":sum,"Stop start number":len})
sj.loc[stopAgg.index - 1,"total stop duration"] = stopAgg.loc[:,"stop duration"].values
sj.loc[stopAgg.index-1,"stop times"] = stopAgg.loc[:,"stop start number"].values
sj.fillna(0,inplace=True) # use 0 interpolation for missing values
stopNo0 = sj["stop times"] != 0 # Determine whether there is a stop in the water event
sj.loc[stopNo0,"average stop duration"] = sj.loc[stopNo0,"total stop duration"]/sj.loc[stopNo0,"stop times"]
sj.fillna(0,inplace=True) # use 0 interpolation for missing values
sj["Water use time"] = sj["Total water use time"] - sj["Total stop time"] #Define the characteristic water use time
sj["Water usage/total duration"] = sj["Water duration"] / sj["Total usage duration"] #Define features Water usage/Total duration
print('After the construction of water use time and frequency features of water use events, the characteristics of the data are:\\
', sj.columns)
print('The first 5 rows and 5 columns of the data after the completion of the construction of water use time and frequency features are:\\
',sj.iloc[:5,:5])

The result is as follows

2) Construct water consumption and fluctuation attributes

In addition to the duration, pause, and frequency of water use, water consumption is also an important attribute to identify whether the event is a bathing event. At the same time, fluctuations in water use are also the key to distinguish different water use events. Construct water consumption and water fluctuation attributes on the basis of water use duration and frequency attributes.

data["water flow"] = data["water flow"] / 60 # The original unit is L/min, now converted to L/sec
sj["total water consumption"] = 0 # Assign an initial value of 0 to the total water consumption
for i in range(len(sj)):
    Start = sj.loc[i,"event start number"]-1
    End = sj.loc[i,"event termination number"]-1
    if Start != End:
        for j in range(Start,End):
            if data.loc[j,"water flow"] != 0:
                sj.loc[i,"total water consumption"] = (data.loc[j + 1,"occurrence time"] -
                                    data.loc[j,"Occurrence time"]).seconds* \
                                    data.loc[j,"water flow"] + sj.loc[i,"total water consumption"]
        sj.loc[i,"total water consumption"] = sj.loc[i,"total water consumption"] + data.loc[End,"water flow"] * 2
    else:
        sj.loc[i,"total water consumption"] = data.loc[Start,"water flow"] * 2
        
sj["Average Water Flow"] = sj["Total Water Consumption"] / sj["Water Time Length"] #Definition Features Average Water Flow
# Structural features: water flow fluctuations
# Fluctuation of water flow=∑((((value of single water flow-average water flow)^2)*duration)/duration of water use
sj["water flow fluctuation"] = 0 # assign an initial value of 0 to the water flow fluctuation
for i in range(len(sj)):
    Start = sj.loc[i,"event start number"] - 1
    End = sj.loc[i,"event end number"] - 1
    for j in range(Start,End + 1):
        if data.loc[j,"water flow"] != 0:
            slbd = (data.loc[j,"water flow"] - sj.loc[i,"average water flow"])**2
            slsj = (data.loc[j + 1,"time of occurrence"] - data.loc[j,"time of occurrence"]).seconds
            sj.loc[i,"water flow fluctuation"] = slbd * slsj + sj.loc[i,"water flow fluctuation"]
    sj.loc[i,"Water flow fluctuation"] = sj.loc[i,"Water flow fluctuation"] / sj.loc[i,"Water duration"]

# Structural features: pause duration fluctuates
# Pause duration fluctuation=∑(((single pause duration-average pause duration)^2)*duration)/total pause duration
sj["pause duration fluctuation"] = 0 # assign an initial value of 0 to the pause duration fluctuation
for i in range(len(sj)):
    if sj.loc[i,"Number of Pauses"] > 1: # When the number of pauses is 0 or 1, the fluctuation value of pause duration is 0, so it is excluded
        for j in Stop.loc[Stop["Stop attribution event"] == (i + 1),"Stop duration"].values:
            sj.loc[i,"pause duration fluctuation"] = ((j - sj.loc[i,"average pause duration"])**2) * j + \
                                     sj.loc[i,"pause duration fluctuation"]
        sj.loc[i,"pause duration fluctuation"] = sj.loc[i,"pause duration fluctuation"] / sj.loc[i,"total pause duration"]

print('The characteristics of the data after the construction of water consumption and fluctuation characteristics are:\\
', sj.columns)
print('The features of the first 5 rows and 5 columns of the data after the construction of water consumption and fluctuation features are:\\
',sj.iloc[:5,:5])

The result is as follows

3.5 Screening Candidate Bathing Events

The identification of a bathing event is based on the identification of a water use event, that is, identifying which water use events are bathing events from the divided water use events. Candidate bathing events are screened based on the constructed water duration and water consumption attributes.

sj_bool = (sj['Water use time'] >100) & amp; (sj['Total water use time'] > 120) & amp; (sj['Total water use'] > 5)
sj_final = sj.loc[sj_bool,:]
sj_final.to_excel('./data/sj_final.xlsx',index=False)
print('The shape of the data before filtering out the candidate bathing event is:', sj.shape)
print('The shape of the data after filtering out candidate bathing events is:', sj_final.shape)

The result is as follows

Before screening, the total number of water use events was 172, and after screening, there were 75 water use events remaining.

4. Model building

Based on the modeling sample data, a BP neural network model is established to identify bathing events. The construction of BP neural network model needs to pay attention to the magnitude difference between the attributes of the data itself, so it needs to be standardized to eliminate the magnitude difference. In order to facilitate the subsequent application of the model, the model can be saved with the joblib.dump function.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.externals import joblib

# read data
Xtrain = pd.read_excel('./data/sj_final.xlsx')
ytrain = pd.read_excel('./data/water_heater_log.xlsx')
test = pd.read_excel('./data/test_data.xlsx')
# Separate the training set from the test set.
x_train, x_test, y_train, y_test = Xtrain.iloc[:,5:],test.iloc[:,4:-1],\
                                   ytrain.iloc[:,-1],test.iloc[:,-1]
# normalize
stdScaler = StandardScaler(). fit(x_train)
x_stdtrain = stdScaler. transform(x_train)
x_stdtest = stdScaler. transform(x_test)
# Modeling
bpnn = MLPClassifier(hidden_layer_sizes = (17,10), max_iter = 200, solver = 'lbfgs', random_state=50)
bpnn.fit(x_stdtrain, y_train)
# save the model
joblib.dump(bpnn,'./data/water_heater_nnet.m')
print('The model built is:\\
',bpnn)

The result is as follows

When training the BP neural network, the parameters of the neural network were optimized, and it was found that the training effect of the neural network with two hidden layers was better, and the training effect was better when the hidden nodes of the two hidden layers were 17 and 10 respectively. good.

5. Model checking

Combined with the relevant knowledge of model evaluation, it is more objective and accurate to use precision rate (precision), recall rate (recall) and F1 value to measure the effect of model evaluation. At the same time, combined with the ROC curve, the effect of the model can be evaluated more intuitively.

# model evaluation
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.externals import joblib
import matplotlib.pyplot as plt

bpnn = joblib.load('./data/water_heater_nnet.m') # load the model
y_pred = bpnn.predict(x_stdtest) # return the prediction result
print('neural network prediction result evaluation report:\\
', classification_report(y_test,y_pred))
# draw roc curve
plt.rcParams['font.sans-serif'] = 'SimHei' # display Chinese
plt.rcParams['axes.unicode_minus'] = False # display negative sign
fpr, tpr, thresholds = roc_curve(y_pred,y_test) # Find TPR and FPR
plt.figure(figsize=(6,4)) # create canvas
plt.plot(fpr,tpr) # draw the curve
plt.title('User water event identification ROC curve (3001)') # title
plt.xlabel('FPR') # x-axis label
plt.ylabel('TPR') # y-axis label
plt.savefig('./data/User water event recognition ROC curve.png') # save the picture
plt.show() # display graphics

The result is as follows

It can be obtained that the precision rate (precision) in bathing event recognition is very high, as high as 90%, and the recall rate (recall) also reaches more than 70%. Therefore, it can be determined that the model created this time is effective and effective, and can be used in the identification of actual bathing events.