1. Introduction
In the process of using household water heaters, residents will form different usage habits due to regional climate, different regions, and user age and gender differences. If home furnishing companies can deeply understand the usage habits of their products in different user groups, and develop functions that meet customer needs and habits, they will be able to open up new markets.
This article will build a bathing event recognition model based on the BP neural network algorithm, and then identify the bathing events of users in different regions, and then compare different customer groups according to the recognition results to provide the most suitable personalized products and improve the intelligent research and development of new products And formulate corresponding marketing strategies.
2. Data exploration and analysis
Explore and analyze the flow status of the water heater, among which the attributes of “whether there is water flow” and “water flow” can most intuitively reflect the water flow of the water heater, and explore and analyze these two attributes.
import pandas as pd import matplotlib.pyplot as plt inputfile ='./data/original_data.xls' #'./original_data.xls' # input data file data = pd.read_excel(inputfile) # read data # Check the distribution with or without water flow # Data Extraction lv_non = pd.value_counts(data['with or without water flow'])['no'] lv_move = pd.value_counts(data['with or without water'])['with'] # draw bar chart fig = plt.figure(figsize = (6 ,5)) # set the canvas size plt.rcParams['font.sans-serif'] = 'SimHei' # set Chinese display plt.rcParams['axes.unicode_minus'] = False plt.bar([0,1], height=[lv_non,lv_move], width=0.4, alpha=0.8, color='skyblue') plt.xticks([index for index in range(2)], ['no','yes']) plt.xlabel('water flow status') plt.ylabel('record number') plt.title('Number of records of different water flow states (3001)') plt. show() plt. close() # View water flow distribution water = data['water flow'] # Draw a box diagram of water flow distribution fig = plt.figure(figsize = (5 ,8)) plt.boxplot(water, patch_artist=True, labels = ['water flow'], # set x-axis title boxprops = {'facecolor':'lightblue'}) # set fill color plt.title('Water flow distribution box plot (3001)') # Display the bottom line of the y-axis plt. grid(axis='y') plt. show()
Get the following bar and box plots
It can be seen that there are obviously more records in the state of no water flow than in the state of water flow. The box is close to 0, indicating that there are many records of no water flow, and the distribution of water flow is consistent with the distribution of water flow state.
3. Data preprocessing
3.1 Attribute reduction
Water heaters collect a lot of water consumption data, and the main object of analysis is water heater users, so redundant attributes will be deleted, for example: “water heater number”, “whether there is water flow”, “energy-saving mode”.
import pandas as pd import numpy as np data = pd.read_excel('./data/original_data.xls') print('The initial data shape is:', data.shape) # Delete the water heater number, whether there is water flow, energy-saving mode attributes data.drop(labels=["water heater number","with or without water flow","energy-saving mode"],axis=1,inplace=True) print('The shape of the data after deleting redundant attributes is:', data.shape) data.to_csv('./data/water_heart.csv', index=False)
The result is as follows
3.2 Classification of water use events
Water heater users’ water consumption data is stored in the database, which records various water consumption events, including bathing, washing hands, brushing teeth, washing face, laundry, washing vegetables, etc. What continuous data should be divided in a large number of state records in this case When a complete water use event. The time division of water use is mainly divided into two steps, that is, to determine the time interval of a single water use, and to calculate the time of two adjacent records.
# read data data = pd.read_csv('./data/water_heart.csv') # Divide water usage events threshold = pd.Timedelta('4 min') # The threshold is 4 minutes data['time of occurrence'] = pd.to_datetime(data['time of occurrence'], format = '%Y%m%d%H%M%S') # convert time format data = data[data['water flow'] > 0] # As long as the flow is greater than 0 records sjKs = data['Occurrence time'].diff() > threshold # Adjacent time forward difference, compare whether it is greater than the threshold sjKs.iloc[0] = True # Let the first time be the start event of the first water use event sjJs = sjKs.iloc[1:] # The result of backward difference sjJs = pd.concat([sjJs,pd.Series(True)]) # Let the last time be the end time of the last water event # Create a data frame and define the sequence of water events sj = pd.DataFrame(np.arange(1,sum(sjKs) + 1),columns = ["event number"]) sj["Event start number"] = data.index[sjKs == 1] + 1 # Define the start number of water use events sj["event termination number"] = data.index[sjJs == 1] + 1 # Define the termination number of the water use event print('When the threshold is 4 minutes, the number of events is:', sj.shape[0]) sj.to_csv('./data/sj.csv', index = False)
The result is as follows
3.3 Determine the duration threshold of a single water use event
For the data of a water heater user, the water use events are divided according to different thresholds to obtain the corresponding number of events, and the number of time is obtained by threshold changes and divisions. Count the number of water use events under each threshold, and then find the optimal threshold through threshold optimization.
# Determine the duration threshold of a single water use event n = 4 # use the average slope of the next four points threshold = pd.Timedelta(minutes=5) # expert threshold data['time of occurrence'] = pd.to_datetime(data['time of occurrence'], format='%Y%m%d%H%M%S') data = data[data['water flow'] > 0] # As long as the flow is greater than 0 records # Custom function: input the time threshold for dividing time, and get the number of divided events def event_num(ts): d = data['Occurrence time'].diff() > ts # Make a difference between adjacent times, and compare whether it is greater than the threshold return d.sum() + 1 # This directly returns the number of events dt = [pd.Timedelta(minutes=i) for i in np.arange(1, 9, 0.25)] h = pd.DataFrame(dt, columns=['threshold']) # convert the data frame and define the threshold column h['number of events'] = h['threshold'].apply(event_num) # Calculate the number of events corresponding to each threshold h['slope'] = h['number of events'].diff()/0.25 # Calculate the slope corresponding to every two adjacent points h['slope index']= h['slope'].abs().rolling(4).mean() # Take the average of the absolute value of n slopes forward as the slope index ts = h['threshold'][h['slope indicator'].idxmin() - n] # Use idxmin to return the index of the minimum value, because rolling_mean() calculates the absolute value average of the first n slopes # So the result is translated (-n) if ts > threshold: ts = pd.Timedelta(minutes=4) print('The threshold value of the calculated single water use time is:',ts)
The result is as follows
3.4 Attribute construction
1) Construct water use duration and frequency attributes
The duration of water use for different water use events is one of the basic attributes. The attributes related to the duration of water use can only distinguish some water use events, and the water use pauses and frequencies of different water use events are also different. Build water usage duration and water usage frequency attributes.
data = pd.read_csv('./data/water_heart.csv', encoding='utf-8') # read water heater usage data records sj = pd.read_csv('./data/sj.csv') # read water usage event records # convert time format data["occurrence time"] = pd.to_datetime(data["occurrence time"], format="%Y%m%d%H%M%S") # Structural features: total water usage time timeDel = pd. Timedelta("0.5 sec") sj["event start time"] = data.iloc[sj["event start number"]-1,0].values- timeDel sj["event end time"] = data.iloc[sj["event end number"]-1,0].values + timeDel sj['bathing time'] = [i.hour for i in sj["event start time"]] sj["total water consumption time"] = np.int64(sj["event end time"] - sj["event start time"])/1000000000 + 1 # Construct water pause event # Construct features "pause start time", "pause end time" # Pause start time means from water flow to no water flow, pause end time means from no water flow to water flow for i in range(len(data)-1): if (data.loc[i,"water flow"] != 0) & amp; (data.loc[i + 1,"water flow"] == 0) : data.loc[i + 1,"Pause start time"] = data.loc[i + 1, "Occurrence time"] - timeDel if (data.loc[i,"water flow"] == 0) & amp; (data.loc[i + 1,"water flow"] != 0) : data.loc[i,"pause end time"] = data.loc[i, "occurrence time"] + timeDel # Extract the line numbers corresponding to the start time and end time of the pause, and put them in the data frame Stop indStopStart = data.index[data["stop start time"].notnull()] + 1 indStopEnd = data.index[data["stop time"].notnull()] + 1 Stop = pd.DataFrame(data={"Stop start number":indStopStart[:-1], "Stop stop number":indStopEnd[1:]}) # Calculate the pause duration and put it in the data frame stop, pause duration = pause end time - pause end time Stop["Stop duration"] = np.int64(data.loc[indStopEnd[1:]-1,"Stop end time"].values- data.loc[indStopStart[:-1]-1,"stop start time"].values)/1000000000 # Match each pause with an event, the start time of the pause must be greater than the start time of the event, # And the end time of the pause is less than the end time of the event for i in range(len(sj)): Stop.loc[(Stop["Stop start number"] > sj.loc[i,"Event start number"]) & amp; (Stop["stop number"] < sj.loc[i,"event stop number"]),"stop attribution event"]=i + 1 # Delete events with a pause count of 0 Stop = Stop[Stop["Stop Attributed Event"].notnull()] # Structural features The total duration of pauses in water events, the number of pauses, the average duration of pauses, # Duration of water use, water usage/total duration stopAgg = Stop.groupby("Stop attribution event").agg({"Stop duration":sum,"Stop start number":len}) sj.loc[stopAgg.index - 1,"total stop duration"] = stopAgg.loc[:,"stop duration"].values sj.loc[stopAgg.index-1,"stop times"] = stopAgg.loc[:,"stop start number"].values sj.fillna(0,inplace=True) # use 0 interpolation for missing values stopNo0 = sj["stop times"] != 0 # Determine whether there is a stop in the water event sj.loc[stopNo0,"average stop duration"] = sj.loc[stopNo0,"total stop duration"]/sj.loc[stopNo0,"stop times"] sj.fillna(0,inplace=True) # use 0 interpolation for missing values sj["Water use time"] = sj["Total water use time"] - sj["Total stop time"] #Define the characteristic water use time sj["Water usage/total duration"] = sj["Water duration"] / sj["Total usage duration"] #Define features Water usage/Total duration print('After the construction of water use time and frequency features of water use events, the characteristics of the data are:\\ ', sj.columns) print('The first 5 rows and 5 columns of the data after the completion of the construction of water use time and frequency features are:\\ ',sj.iloc[:5,:5])
The result is as follows
2) Construct water consumption and fluctuation attributes
In addition to the duration, pause, and frequency of water use, water consumption is also an important attribute to identify whether the event is a bathing event. At the same time, fluctuations in water use are also the key to distinguish different water use events. Construct water consumption and water fluctuation attributes on the basis of water use duration and frequency attributes.
data["water flow"] = data["water flow"] / 60 # The original unit is L/min, now converted to L/sec sj["total water consumption"] = 0 # Assign an initial value of 0 to the total water consumption for i in range(len(sj)): Start = sj.loc[i,"event start number"]-1 End = sj.loc[i,"event termination number"]-1 if Start != End: for j in range(Start,End): if data.loc[j,"water flow"] != 0: sj.loc[i,"total water consumption"] = (data.loc[j + 1,"occurrence time"] - data.loc[j,"Occurrence time"]).seconds* \ data.loc[j,"water flow"] + sj.loc[i,"total water consumption"] sj.loc[i,"total water consumption"] = sj.loc[i,"total water consumption"] + data.loc[End,"water flow"] * 2 else: sj.loc[i,"total water consumption"] = data.loc[Start,"water flow"] * 2 sj["Average Water Flow"] = sj["Total Water Consumption"] / sj["Water Time Length"] #Definition Features Average Water Flow # Structural features: water flow fluctuations # Fluctuation of water flow=∑((((value of single water flow-average water flow)^2)*duration)/duration of water use sj["water flow fluctuation"] = 0 # assign an initial value of 0 to the water flow fluctuation for i in range(len(sj)): Start = sj.loc[i,"event start number"] - 1 End = sj.loc[i,"event end number"] - 1 for j in range(Start,End + 1): if data.loc[j,"water flow"] != 0: slbd = (data.loc[j,"water flow"] - sj.loc[i,"average water flow"])**2 slsj = (data.loc[j + 1,"time of occurrence"] - data.loc[j,"time of occurrence"]).seconds sj.loc[i,"water flow fluctuation"] = slbd * slsj + sj.loc[i,"water flow fluctuation"] sj.loc[i,"Water flow fluctuation"] = sj.loc[i,"Water flow fluctuation"] / sj.loc[i,"Water duration"] # Structural features: pause duration fluctuates # Pause duration fluctuation=∑(((single pause duration-average pause duration)^2)*duration)/total pause duration sj["pause duration fluctuation"] = 0 # assign an initial value of 0 to the pause duration fluctuation for i in range(len(sj)): if sj.loc[i,"Number of Pauses"] > 1: # When the number of pauses is 0 or 1, the fluctuation value of pause duration is 0, so it is excluded for j in Stop.loc[Stop["Stop attribution event"] == (i + 1),"Stop duration"].values: sj.loc[i,"pause duration fluctuation"] = ((j - sj.loc[i,"average pause duration"])**2) * j + \ sj.loc[i,"pause duration fluctuation"] sj.loc[i,"pause duration fluctuation"] = sj.loc[i,"pause duration fluctuation"] / sj.loc[i,"total pause duration"] print('The characteristics of the data after the construction of water consumption and fluctuation characteristics are:\\ ', sj.columns) print('The features of the first 5 rows and 5 columns of the data after the construction of water consumption and fluctuation features are:\\ ',sj.iloc[:5,:5])
The result is as follows
3.5 Screening Candidate Bathing Events
The identification of a bathing event is based on the identification of a water use event, that is, identifying which water use events are bathing events from the divided water use events. Candidate bathing events are screened based on the constructed water duration and water consumption attributes.
sj_bool = (sj['Water use time'] >100) & amp; (sj['Total water use time'] > 120) & amp; (sj['Total water use'] > 5) sj_final = sj.loc[sj_bool,:] sj_final.to_excel('./data/sj_final.xlsx',index=False) print('The shape of the data before filtering out the candidate bathing event is:', sj.shape) print('The shape of the data after filtering out candidate bathing events is:', sj_final.shape)
The result is as follows
Before screening, the total number of water use events was 172, and after screening, there were 75 water use events remaining.
4. Model building
Based on the modeling sample data, a BP neural network model is established to identify bathing events. The construction of BP neural network model needs to pay attention to the magnitude difference between the attributes of the data itself, so it needs to be standardized to eliminate the magnitude difference. In order to facilitate the subsequent application of the model, the model can be saved with the joblib.dump function.
import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.neural_network import MLPClassifier from sklearn.externals import joblib # read data Xtrain = pd.read_excel('./data/sj_final.xlsx') ytrain = pd.read_excel('./data/water_heater_log.xlsx') test = pd.read_excel('./data/test_data.xlsx') # Separate the training set from the test set. x_train, x_test, y_train, y_test = Xtrain.iloc[:,5:],test.iloc[:,4:-1],\ ytrain.iloc[:,-1],test.iloc[:,-1] # normalize stdScaler = StandardScaler(). fit(x_train) x_stdtrain = stdScaler. transform(x_train) x_stdtest = stdScaler. transform(x_test) # Modeling bpnn = MLPClassifier(hidden_layer_sizes = (17,10), max_iter = 200, solver = 'lbfgs', random_state=50) bpnn.fit(x_stdtrain, y_train) # save the model joblib.dump(bpnn,'./data/water_heater_nnet.m') print('The model built is:\\ ',bpnn)
The result is as follows
When training the BP neural network, the parameters of the neural network were optimized, and it was found that the training effect of the neural network with two hidden layers was better, and the training effect was better when the hidden nodes of the two hidden layers were 17 and 10 respectively. good.
5. Model checking
Combined with the relevant knowledge of model evaluation, it is more objective and accurate to use precision rate (precision), recall rate (recall) and F1 value to measure the effect of model evaluation. At the same time, combined with the ROC curve, the effect of the model can be evaluated more intuitively.
# model evaluation from sklearn.metrics import classification_report from sklearn.metrics import roc_curve from sklearn.externals import joblib import matplotlib.pyplot as plt bpnn = joblib.load('./data/water_heater_nnet.m') # load the model y_pred = bpnn.predict(x_stdtest) # return the prediction result print('neural network prediction result evaluation report:\\ ', classification_report(y_test,y_pred)) # draw roc curve plt.rcParams['font.sans-serif'] = 'SimHei' # display Chinese plt.rcParams['axes.unicode_minus'] = False # display negative sign fpr, tpr, thresholds = roc_curve(y_pred,y_test) # Find TPR and FPR plt.figure(figsize=(6,4)) # create canvas plt.plot(fpr,tpr) # draw the curve plt.title('User water event identification ROC curve (3001)') # title plt.xlabel('FPR') # x-axis label plt.ylabel('TPR') # y-axis label plt.savefig('./data/User water event recognition ROC curve.png') # save the picture plt.show() # display graphics
The result is as follows
It can be obtained that the precision rate (precision) in bathing event recognition is very high, as high as 90%, and the recall rate (recall) also reaches more than 70%. Therefore, it can be determined that the model created this time is effective and effective, and can be used in the identification of actual bathing events.