Individual household electric power consumption data mining and time series forecasting modeling

I received a task today that needs to perform calculations related to data mining analysis based on a given data set, and complete predictive modeling of data in a future period.

The official data details are introduced here, as follows:

The data set contains a total of 9 different fields, the details are as follows:

1.date: the date format is dd/mm/yyyy

2. Time: time, the format is hh:mm:ss

3.global_active_power: global minute average active power of households (kW)

4.global_reactive_power: household global minute average reactive power (kW)

5. Voltage: minute average voltage (unit: volts)

6.global_intensity: global minute average current intensity for home use (ampere)

7. sub_metering_1: No. 1 energy quantum metering (in watt-hours of active energy). It corresponds to the kitchen and mainly includes a dishwasher, an oven and a microwave (the hot plate is not electric, but gas).

8sub_metering_2: Quantum metering of energy number 2 (in watt-hours of active energy). It corresponds to the laundry room, which contains a washing machine, a tumble dryer, a refrigerator and a lamp.

9_metering_3: Quantum metering of energy number 3 (in watt-hours of active energy). It corresponds to electric water heaters and air conditioners. 

If you need the dataset, you can download it yourself, here. A screenshot of the data details is as follows:

The first is to load and read the data set, which can be implemented directly using Pandas as follows:

def loadData(data="household_power_consumption.txt"):
    """
    load dataset
    """
    df=pd.read_csv(data,sep=";")
    print(df. head(10))
    for one_name in names[2:]:
        df[one_name].fillna(df[one_name].mean(), inplace=True)
    data_list = df.values.tolist()
    return data_list

While loading the local data set, different columns of data are filled based on the mean value. Here, other methods can be used for data filling, such as the built-in mode, median, specified value, mean value filling, etc. of pandas. Yes, when I was working on the environmental protection brain project, we actually had a more fine-grained processing method for time series data. It is mainly divided into several methods: sliding window data filling, moving weighted data filling, and Kalman filter data filling. The comparison effect diagram of different filling algorithms is as follows:

Considering the problem of time here, I mainly use the more commonly used sliding window data filling algorithm. The schematic diagram of the filling principle is as follows:

This method is more fine-grained in filling missing values of time series data, instead of simply using data such as mean and median directly to fill missing values.

The code implementation is as follows:

What dataProcessing provides is the filling processing calculation of factor data, and then the filled factor data is transposed to obtain a new filled data set.

Next, we first perform a simple visualization of the original data set, as shown below:

Excluding the first two columns which are time columns, there are a total of 7 data columns, which are visualized as a whole here, because the amount of sample data of more than 2 million makes the visualized image very dense and difficult to see the overall trend, here is the data Perform thinning processing and draw the data curve of hourly granularity, as shown below:

It can be seen that the overall data trend is still relatively dense, and it needs to continue to thin out.

With the same processing idea, we can draw the data curve of daily granularity, as shown below:

The data at the daily granularity is relatively clearly visible. Judging from the data trend of different factors, the periodicity of the data is quite obvious.

Next, I want to mine and analyze different factor data to calculate the correlation relationship between variables and draw a heat map. If there is a problem with this implementation, please refer to the article I wrote earlier:

“Visual analysis of correlation heat map based on seaborn”

“Python draws favorite heat map based on seaborn, list of different color systems”

“Three major correlation coefficients in python practical statistics, and draw a heat map of correlation analysis”

The code implementation and result examples are very detailed, I believe it can help you realize this part of the function.

The correlation heat map is mainly based on the visualization of the correlation values between different variables. The essence is to calculate the correlation between variables. The correlation algorithms I often use here are: Pierce coefficient, Spearman coefficient and Ken Del coefficient, of course, other methods of Seagull can be used, and you can choose according to your own preferences. The core implementation of this code is as follows:

There are four options available, which are separate methods and simple weighted methods. Next, let’s look at the results of visual analysis based on daily granularity data:

【Pierce Coefficient】

【Spearman Coefficient】

【Kendall Coefficient】

【Weighted average method】

It can be seen that the results obtained by different calculation methods are slightly different, but the overall trend is the same.

Global_active_power and Global_intensity are highly correlated

Global_active_power is highly correlated with Sub_metering_1, Sub_metering_2, and Sub_metering_3

This is the end of the simple numerical analysis, and other relationships can be found from the heat map, so I won’t describe it here.

Next, let’s try to analyze different time periods and different dates such as (working days, non-working days, holidays), and try to understand the characteristics of electricity consumption at different time granularities. The code implementation is as follows:

Let’s first look at the differences in data presented at the hourly granularity:

The voltage is relatively stable, and there is almost no difference in the granularity of different hours.

The overall difference in Global_intensity is relatively obvious, and it is low in the middle of the night and early morning.

The overall trend of global_active_power and global_reactive_power has a similar performance.

The overall trends of Sub_metering_1, Sub_metering_2, and Sub_metering_3 are basically the same.

Next, let’s look at the overall differences in the data in the two dimensions of day and night. The code implementation is similar to the hour granularity so I won’t go into details here.

It is more intuitive to draw it on a chart. One point to note here is that different people may have different understandings of the division of time between day and night. My settings here are:

day_list=["08","09","10","11","12","13","14"," 15","16","17","18"]
night_list=["00","01","02","03","04","05","06","07" ,"19","20","21","22","23"]

In fact, it can be further refined here. For example, there are differences between daytime and nighttime periods in different seasons. Of course, the reason for time here is no longer detailed.

Finally, we want to explore the differences in power consumption during different time periods on weekdays, rest days, and holidays. The overall implementation is completely consistent. Here we just look at the results. Here I have divided three granularities: Working days, non-working days and holidays, as follows:

The changes in electricity consumption during working days and non-working days are more obvious. Since non-working days and holidays overlap in their own time periods, the trends presented by the data are also relatively similar. Finally, for intuitive presentation, they are also drawn together, as follows:

At this point, data processing and EDA are basically over, and the next main content is to implement predictive modeling and analysis of future electricity consumption data based on the model. There are two main types of forecasts in time series forecasting tasks: univariate forecasting and multivariate sequence forecasting. Different methods are applicable to different scenarios or have different development purposes. Combined with the previous data heat map analysis, here are many Variable sequence forecasting models are more appropriate.

In the previous steps, we have parsed and processed the original data set and stored it in the feature.json file, which can be used directly here. While loading the data set, the original data set has been thinned out, otherwise the training of the model It will be extremely time-consuming, and then the data is normalized to eliminate the impact of different dimensions, improve the convergence speed of subsequent iterations of the model, and also help improve the accuracy of the model. The implementation of this part of the code is as follows:

Next, you can create a data set. The time series data itself is sequential data. The common way is to create a data set based on a sliding window. The schematic diagram of the principle is as follows:

You can also freely set the step size and interval to dynamically adjust the shape of the obtained data set.

After completing the construction of the data set, the model can be developed. For regression tasks such as time series prediction, the most basic model may be a statistical model such as ARIMA, which is actually applied later over time. There are very few scenarios, so basically the base model selected for this type of task will be machine learning, such as: SVR, RFR, XGBR, etc. Different models can be very simple and unified with the help of the sklearn module To achieve, here I simply take XGBoost as an example to see the actual implementation.

model = xgb.XGBRegressor(
            colsample_bytree=colsample_bytree,
            booster = booster,
            max_depth=max_depth,
            learning_rate=learning_rate,
            n_estimators=n_estimators,
            silent=False,
            verbosity=0,
            objective=objective,
            gamma=gamma,
            min_child_weight=min_child_weight,
            subsample=subsample,
            reg_alpha=reg_alpha,
            reg_lambda=reg_lambda,
            tree_method = tree_method,
            callbacks=[pruning_callback],
        )

Next, take Global_active_power as an example to see the actual prediction effect:

The overall comparison is as follows:

With the help of XGBoost’s built-in feature importance analysis tool, it is also very convenient to perform analysis and visualization as follows:

Next, look at the results of the non-power index factor Global_reactive_power:

Finally, let’s look at the overall effect of all factors as a whole. The effects of different parameters and different data division ratios will also be different. You can adjust the appropriate parameters according to your actual configuration. Here we directly look at the results. :

The model parameter configuration here can be configured according to your own experience, and then debugged with the help of parameter optimization methods such as grid search or random search. Here I give a simple code example, as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
xgb = XGBRegressor()
#Set the parameter space for grid search:
python
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 300]
}
#Use GridSearchCV for grid search:
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5)
grid_search. fit(X_train, y_train)
#Print the best parameter combination and the corresponding R2 score:
print("Best parameters: ", grid_search. best_params_)
print("Best R2 score: ", grid_search. best_score_)


#Use RandomizedSearchCV for random search
random_search = RandomizedSearchCV(estimator=xgb, param_distributions=param_grid, cv=5, n_iter=10, random_state=42)
random_search. fit(X_train, y_train)
#Print the best parameter combination and the corresponding R2 score
print("Best parameters: ", random_search. best_params_)
print("Best R2 score: ", random_search. best_score_)

The specific data set can be modified according to your actual situation, and you can also set more parameters you want in the parameter space.

Of course, there are also some excellent hyperparameter optimization modules in the open source community. Here are some of them:

The following are some common hyperparameter optimization modules for machine learning models:
GridSearchCV: The GridSearchCV class in the Scikit-learn library provides a grid search based method for hyperparameter optimization. It finds the best combination of hyperparameters by exhaustively searching all possible combinations in a given parameter space and evaluating the performance of each combination.
RandomizedSearchCV: The RandomizedSearchCV class in the Scikit-learn library provides a method for hyperparameter optimization based on random search. It is similar to GridSearchCV, but differs in that it randomly selects a set of subsets in a given parameter space for evaluation to reduce computational overhead.
Bayesian Optimization: Bayesian optimization is a hyperparameter optimization method based on probabilistic models and Bayesian inference. It leverages prior and posterior information to select the next hyperparameter combination to evaluate, allowing more efficient exploration of the parameter space. Commonly used Bayesian optimization libraries include scikit-optimize (skopt), GPyOpt, etc.
Optuna: Optuna is an open source framework for hyperparameter optimization that supports multiple optimization algorithms, such as TPE algorithm based on random forest, CMA-ES algorithm, etc. It provides a concise API and visualization tools, making the hyperparameter optimization process more convenient and visualized.
Hyperopt: Hyperopt is another library for hyperparameter optimization that employs a tree-based Parzen Estimator (TPE) to search high-dimensional parameter spaces. It also supports parallel and distributed computing to speed up the optimization process.
Talos: Talos is a hyperparameter optimization library for Keras models. It can automatically adjust parameters such as learning rate, batch size, optimizer type, etc., and provides a series of evaluation indicators to evaluate different hyperparameter combinations. 

Here I am mining and calculating the optimal parameters based on the modules. The advantage is that it is very intelligent. The disadvantage is that you must first learn the document and understand the demo and then develop your own business part. After that, it will be a long calculation process:

After the calculation is completed, you can store and download the optimal parameters in the code logic and use them directly, or you can ignore this part, because the entire parameter adjustment process is recorded, and the log can be visualized, as shown below:

This place is not the focus and will not be expanded.

If you want to use other models such as random forests and support vector machines, they all have the same processing logic, and you can abstract the model instantiation separately during implementation, so that the overall logic does not need to be changed.

In addition to machine learning models, deep learning models are also commonly used in time series forecasting tasks. Here, multivariate sequence forecasting modeling is used as the benchmark, and the selectivity of the model is relatively wide. LSTM, RNN, Models such as GRU and CNN can also use a combination of models, such as CNN-LSTM, CNN-GRU, etc., here the problem of time is no longer necessary to do experiments one by one, here is mainly with the help of Keras After building the initialization model, the sequence structure of the overall level is relatively clear. Take a brief look at the parameter structure diagram of the model, as shown below:

【CNN-GRU】

【GRU】

Although the data has been thinned at the source, the calculation of the deep learning model itself is still relatively large. Considering the problems of other projects, here is only a demo, and a brief look at the training situation:

Only 20 epochs are set in the experimental operation settings, and model experiments of different magnitudes can be freely built:

Just look at the effect:

【Model loss curve】

[Result comparison visualization curve]

Forecast for the week ahead:

During the actual comparative experiment, it was found that some relatively weak models often “drift” in the mid-to-long-term prediction process. Because of the time here, I did not continue to increase the data or increase the epoch to calculate. I am interested. If you can follow this line of thinking, you can verify it.

In order to facilitate the evaluation and visualization of the model, a dedicated method has been implemented, which is actually available in my previous articles. Here is a simple look:

Loss visualization:

Many related implementations such as models, tools, and visualizations have been written in my previous articles, so I won’t expand them one by one here. This has already taken several hours.

Finally, if we want to mine the difference in electricity consumption characteristics in different periods, we need to complete this work based on the Kmeans algorithm. The K-Means algorithm is a commonly used unsupervised learning algorithm, which is used to divide the data set into K different clusters ( cluster). Each cluster has similar characteristics, and data points within a cluster are closer to each other. The following are the detailed steps of the K-Means algorithm:
Select K initial centroids: Randomly select K data points from the dataset as initial centroids.
Assign data points to nearest centroid: For each data point, calculate its distance from each centroid and assign it to the cluster to which the nearest centroid belongs.
Update centroid location: For each cluster, calculate the mean of all data points in that cluster and use it as the new centroid location.
Repeat steps 2 and 3 until the centroid position no longer changes or reaches the predefined upper limit of iterations.
Final clustering result: Get the final cluster division result, and each data point is assigned to a cluster.
The goal of the K-Means algorithm is to minimize the sum of squared errors (Sum of Squared Errors, SSE) between the data points in the cluster and the centroid. By iteratively optimizing the centroid locations, the algorithm tries to find the best cluster partition that minimizes SSE.
Features and precautions of the K-Means algorithm:
Selection of K value: K is provided to the algorithm as an input parameter, and an appropriate value needs to be selected according to actual problems and experiences. Different K values may lead to different clustering results.
Selection of initial centroids: The selection of initial centroids can affect the final clustering results, so care should be taken to use random seeds or run the algorithm multiple times to avoid local optimal solutions.
Data preprocessing: Before applying the K-Means algorithm, it is usually necessary to standardize or normalize the data to ensure that the individual features have similar importance.
Although the K-Means algorithm performs well in many scenarios, there are some limitations:
Sensitive to initial centroid position: The choice of initial centroid may affect the final result, and the algorithm may fall into a local optimal solution.
Difficulty dealing with non-spherical clusters: The K-Means algorithm assumes that the clusters are convex and have the same variance, so it may be less effective for non-spherical clusters of different sizes.
A value of K needs to be specified: The choice of the value of K is often subjective, and larger values of K may lead to overfitting.
Here it is necessary to determine the appropriate number of cluster centers K, and the commonly used method is the elbow method.

The Elbow Method is a commonly used method that can be used to help determine the optimal number of clusters in the K-Means clustering algorithm. It is based on clustering SSE (Sum of Squared Errors) or the sum of squared errors to evaluate the clustering effect under different K values. Here are the steps to determine the optimal number of clusters using the elbow method:
Select the K value within a given range: First, select an appropriate range of K values, for example, starting from 2 to the preset maximum number of clusters.
Calculate the SSE corresponding to each K value: For each K value, run the K-Means algorithm on the data set, and calculate the SSE under the K value.
Draw the relationship between SSE and K value: draw the SSE corresponding to each K value as a line graph or a curve graph.
Look for the “elbow point”: Look at the SSE vs. K value graph and look for an obvious inflection point or “elbow point” after which further increases in K value will result in a smaller reduction in SSE.
Determine the optimal number of clusters: select the K value corresponding to the elbow point as the optimal number of clusters.
It should be noted that the elbow method does not always clearly indicate the optimal number of clusters, especially when the data set does not have obvious elbow points. In this case, choosing an appropriate number of clusters can be considered in combination with other evaluation metrics, domain knowledge, and practical problems.

Here we use the elbow method to draw the corresponding curve based on the daily granularity data as follows:

Combined with the analysis of the result graph, the optimal number of cluster centers is considered to be 4.

Next, set the clustering center to 4 for clustering calculation. The use of Kmeans is still very simple. You can directly use the built-in module of sklearn, as follows:

Here, the data result graph after clustering of 7 factors is drawn, as shown below:

The voltage should remain almost unchanged all the time, and the differences in Global_active_power, Global_reactive_power, Global_intensity and Sub_metering_3 are quite obvious.

I wrote it for a long time without knowing it, and it can be regarded as a record of the overall process.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeHomepageOverview 326274 people are studying systematically