Python deep data mining power system load forecasting

Article directory

Preface
1. Case background
2. Analyze goals
3. Analysis process
4. Data preparation
- 4.1 Data exploration
- 4.2 Missing value processing
5. Attribute construction
- 5.1 Device data
- 5.2 Cycle data
6. Model training
7. Performance Measurements
8. Recommended reading and fan benefits

Foreword

This case will deeply explore the current, voltage and power of each power equipment based on the collected power data, analyze the actual power consumption of each power equipment, and then provide a certain reference basis for the power company to formulate power energy strategies. . For more details, please refer to the book “Python Data Mining: Introductory Advanced and Practical Case Analysis”.

1. Case background

In order to better monitor the energy consumption of electrical equipment, power sub-metering technology was born. Electric power sub-metering is of great significance for power companies to accurately predict power loads, scientifically formulate power grid dispatch plans, and improve the stability and reliability of power systems. For users, electricity sub-metering can help users understand the usage of electrical equipment, improve users’ awareness of energy conservation, and promote scientific and rational use of electricity.

2. Analysis goals

Based on the background and business needs of power data mining for non-intrusive load detection and decomposition, the goals to be achieved in this case are as follows.

Analyze the operating attributes of each electrical device.
Build a device identification attribute library.
The K nearest neighbor model is used to “decompose” the independent power consumption data of each electrical device from the entire line.

3. Analysis process

The detailed analysis process can be seen in the figure below, from data sources to final data preparation, and finally to performance measurement.

4. Data preparation

4.1 Data Exploration

In the power data mining analysis of this case, operation record data will not be involved. Therefore, equipment data, cycle data and harmonic data are mainly obtained here. After obtaining the data, since there are many data tables and each table has many attributes, it is necessary to perform data exploration and analysis on the data. During the data exploration process, the data corresponding to the different attributes of each device was visualized mainly based on the characteristics of the original data. Some of the results obtained are shown in Figures 1 to 3 below.

(Figure 1 Reactive power and total reactive power)

(Figure 2 Current trace)

(Figure 3 Voltage trace)

As can be seen from the visualization results, the current, voltage and power properties vary between different devices.

Visualize the data attributes as shown in code listing 1.

import pandas as pd
import matplotlib.pyplot as plt
import os

 

filename = os.listdir('../data/attachment 1')
n_filename = len(filename)

def fun(a):
    save_name = ['YD1', 'YD10', 'YD11', 'YD2', 'YD3', 'YD4',
           'YD5', 'YD6', 'YD7', 'YD8', 'YD9']

    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False

    for i in range(a):
        Sb = pd.read_excel('../data/Attachment 1/' + filename[i], 'Device data', index_col = None)
        Xb = pd.read_excel('../data/attachment 1/' + filename[i], 'harmonic data', index_col = None)
        Zb = pd.read_excel('../data/Attachment 1/' + filename[i], 'Cycle data', index_col = None)

        plt.plot(Sb['IC'])
        plt.title(save_name[i] + '-IC')
        plt.ylabel('Current (0.001A)')
        plt.show()
        
        lt.plot(Sb['UC'])
        plt.title(save_name[i] + '-UC')
        plt.ylabel('Voltage (0.1V)')
        plt.show()
        
        plt.plot(Sb[['PC', 'P']])
        plt.title(save_name[i] + '-P')
        plt.ylabel('Active power (0.0001kW)')
        plt.show()
        
        plt.plot(Sb[['QC', 'Q']])
        plt.title(save_name[i] + '-Q')
        plt.ylabel('Reactive power (0.0001kVar)')
        plt.show()

       
        plt.plot(Sb[['PFC', 'PF']])
        plt.title(save_name[i] + '-PF')
        plt.ylabel('Power factor (%)')
        plt.show()
        
        plt.plot(Xb.loc[:, 'UC02':].T)
        plt.title(save_name[i] + '-harmonic voltage')
        plt.show()

        plt.plot(Zb.loc[:, 'IC001':].T)
        plt.title(save_name[i] + '-cycle data')
        plt.show()

fun(n_filename)`

4.2 Missing value processing

Through data exploration, it is found that some time attributes in the data have missing values, and these missing values need to be processed. Since the missing time period of the time attribute in each piece of data is different, different processing is required. The data with a larger missing time period in each device data is deleted, and the data with a smaller missing time period is interpolated using the previous value.

Before processing missing values, it is necessary to convert the equipment data table, cycle data table, harmonic data table and operation record table in all equipment data in the training data, as well as the equipment data table, cycle number in all equipment data in the test data. Both the data table and the harmonic data table are extracted as independent data files, and some of the generated files are shown in Figure 4.

?(Figure 4 Partial results of extracting data files)

Extract data files as shown in code listing 2.

import glob
import pandas as pd
import math

 

def file_transform(xls):
    print('A total of %s xlsx files found' % len(glob.glob(xls)))
    print('Processing......')

    for file in glob.glob(xls):
        combine1 = pd.read_excel(file, index_col=0, sheet_name=None)
        for key in combine1:
            combine1[key].to_csv('../tmp/' + file[8: -5] + key + '.csv', encoding='utf-8')
    print('Processing completed')

 

xls_list = ['../data/Attachment 1/*.xlsx', '../data/Attachment 2/*.xlsx']
file_transform(xls_list[0])
file_transform(xls_list[1])`

After the extraction of data files is completed, the extracted data files are processed for missing values. Some of the files generated after processing are shown in Figure 5.

? (Figure 5 Partial results after missing value processing)

 def missing_data(evi):
    print('A total of %s CSV files found' % len(glob.glob(evi)))
    
    for j in glob.glob(evi):
        fr = pd.read_csv(j, header=0, encoding='gbk')
        fr['time'] = pd.to_datetime(fr['time'])
        helper = pd.DataFrame({<!-- -->'time': pd.date_range(fr['time'].min(), fr['time'].max(), freq='S')})

        fr = pd.merge(fr, helper, on='time', how='outer').sort_values('time')
        fr = fr.reset_index(drop=True)
        frame = pd.DataFrame()

        for g in range(0, len(list(fr['time'])) - 1):
            if math.isnan(fr.iloc[:, 1][g + 1]) and math.isnan(fr.iloc[:, 1][g]):
                continue

            else:
                scop = pd.Series(fr.loc[g])
                frame = pd.concat([frame, scop], axis=1)

        frame = pd.DataFrame(frame.values.T, index=frame.columns, columns=frame.index)
        frames = frame.fillna(method='ffill')
        frames.to_csv(j[:-4] + '1.csv', index=False, encoding='utf-8')

    print('Processing completed')

 

evi_list = ['../tmp/Attachment 1/*Data.csv', '../tmp/Attachment 2/*Data.csv']
missing_data(evi_list[0])
missing_data(evi_list[1])`

5. Attribute construction

Although the attributes were initially processed during the data preparation process, too many attributes were introduced, and there was duplicate information between these attributes. In order to retain important attributes and establish an accurate and simple model, the original attributes need to be further screened and constructed.

5.1 Device Data

During the data exploration process, it was found that the reactive power, total reactive power, active power, total active power, power factor and total power factor of different equipment are very different and have a high degree of discrimination. Therefore, None was selected in this case Active power, total reactive power, active power, total active power, power factor and total power factor are used as attributes of equipment data to build a discriminant attribute library.

After handling the missing values, the data of each device has changed from one table to multiple tables, so it is necessary to merge the same type of data tables into one table, such as merging the device data tables of all devices into one table among. At the same time, because one of the ways to deal with missing values is to use the previous value for interpolation, the same records are generated, and repeated records need to be processed. The data table generated after processing is shown in Table 1.

Merge and remove duplicate device data as shown in code listing 4:

import glob
import pandas as pd
import os

def combined_equipment(csv_name):
    print('A total of %s CSV files found' % len(glob.glob(csv_name)))
    print('Processing......')

    for i in glob.glob(csv_name):

        fr = open(i, 'rb').read()
        file_path = os.path.split(i)
        with open(file_path[0] + '/device_combine.csv', 'ab') as f:
            f.write(fr)

    print('Merger completed!')

    

    df = pd.read_csv(file_path[0] + '/device_combine.csv', header=None, encoding='utf-8')
    datalist = df.drop_duplicates()
    datalist.to_csv(file_path[0] + '/device_combine.csv', index=False, header=0)

    print('Duplication removal completed')

csv_list = ['../tmp/Attachment 1/*Device Data 1.csv', '../tmp/Attachment 2/*Device Data 1.csv']

combined_equipment(csv_list[0])
combined_equipment(csv_list[1])

5.2 Cycle data

During the data exploration process, it was found that the current in the cycle data fluctuates greatly with time. The fluctuations in the line graphs drawn by the current in the cycle data of different devices are not the same, and there are obvious differences, so In this case, wave peaks and wave troughs are selected as attributes of the cycle data to build a discriminant attribute library.

Since there are no current peak and trough attributes in the original cycle data, it is necessary to construct the attributes. The generated data table is shown in Table 2.

The attribute code in constructing the weekly wave data is shown in Code Listing 5:

import glob
import pandas as pd
from sklearn.cluster import KMeans
import os

 

def cycle(cycle_file):
    for file in glob.glob(cycle_file):

        cycle_YD = pd.read_csv(file, header=0, encoding='utf-8')
        cycle_YD1 = cycle_YD.iloc[:, 0:128]
        models = []

        for types in range(0, len(cycle_YD1)):

            model = KMeans(n_clusters=2, random_state=10)
            model.fit(pd.DataFrame(cycle_YD1.iloc[types, 1:]))
            models.append(model)

        mean = pd.DataFrame()
        for model in models:
            r = pd.DataFrame(model.cluster_centers_, )
            r = r.sort_values(axis=0, ascending=True, by=[0])
            mean = pd.concat([mean, r.reset_index(drop=True)], axis=1)

        mean = pd.DataFrame(mean.values.T, index=mean.columns, columns=mean.index)
        mean.columns = ['Trough', 'Peak']
        mean.index = list(cycle_YD['time'])
        mean.to_csv(file[:-9] + 'trough peak.csv', index=False, encoding='gbk ')

cycle_file = ['../tmp/attachment 1/*cycle data 1.csv', '../tmp/attachment 2/*cycle data 1.csv']
cycle(cycle_file[0])
cycle(cycle_file[1])

 

def merge_cycle(cycles_file):
    means = pd.DataFrame()

    for files in glob.glob(cycles_file):
        mean0 = pd.read_csv(files, header=0, encoding='gbk')
        means = pd.concat([means, mean0])

    file_path = os.path.split(glob.glob(cycles_file)[0])
    means.to_csv(file_path[0] + '/zuhe.csv', index=False, encoding='gbk')

    print('Merger completed')

 
cycles_file = ['../tmp/Attachment 1/*Trough and Peak.csv', '../tmp/Attachment 2/*Trough and Peak.csv']

merge_cycle(cycles_file[0])
merge_cycle(cycles_file[1])`

6. Model training

When identifying the type of device, select the K nearest neighbor model for identification, use the attribute library constructed from attributes to train the model, and then use the trained model to identify device 1 and device 2. Build a discriminant model and identify device types, as shown in code listing 6.

import glob
import pandas as pd
from sklearn import neighbors
import pickle
import os

def model(test_files, test_devices):
    zuhe = pd.read_csv('../tmp/Attachment 1/zuhe.csv', header=0, encoding='gbk')

    device_combine = pd.read_csv('../tmp/Attachment 1/device_combine.csv', header=0, encoding='gbk')
    train = pd.concat([zuhe, device_combine], axis=1)
    train.index = train['time'].tolist()
    train = train.drop(['PC', 'QC', 'PFC', 'time'], axis=1)
    train.to_csv('../tmp/' + 'train.csv', index=False, encoding='gbk')

    for test_file, test_device in zip(test_files, test_devices):
        test_bofeng = pd.read_csv(test_file, header=0, encoding='gbk')
        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')
        test = pd.concat([test_bofeng, test_devi], axis=1)
        test.index = test['time'].tolist()
        test = test.drop(['PC', 'QC', 'PFC', 'time'], axis=1)

        clf = neighbors.KNeighborsClassifier(n_neighbors=6, algorithm='auto')
        clf.fit(train.drop(['label'], axis=1), train['label'])

        predicted = clf.predict(test.drop(['label'], axis=1))
        predicted = pd.DataFrame(predicted)
        file_path = os.path.split(test_file)[1]

        test.to_csv('../tmp/' + file_path[:3] + 'test.csv', encoding='gbk')
        predicted.to_csv('../tmp/' + file_path[:3] + 'predicted.csv', index=False, encoding='gbk')

        with open('../tmp/' + file_path[:3] + 'model.pkl', 'ab') as pickle_file:
            pickle.dump(clf, pickle_file)
        print(clf)


model(glob.glob('../tmp/attachment 2/*trough and peak.csv'), glob.glob('../tmp/attachment 2/*device data 1.csv'))

7. Performance Measurement

Based on the device identification results in code listing 6, the model is evaluated and the results are as follows. The confusion matrix is shown in Figure 7 and the ROC curve is shown in Figure 8.

Model classification accuracy: 0.7951219512195122
Model evaluation report:

               precision recall f1-score support

         0.0 1.00 0.84 0.92 64

        21.0 0.00 0.00 0.00 0

        61.0 0.00 0.00 0.00 0

        91.0 0.78 0.84 0.81 77

        92.0 0.00 0.00 0.00 5

        93.0 0.76 0.75 0.75 59

       111.0 0.00 0.00 0.00 0

   accuracy 0.80 205
   macro avg 0.36 0.35 0.35 205
weighted avg 0.82 0.80 0.81 205

Calculate auc: 0.8682926829268293`

The confusion matrix is shown below:

The ROC curve is as follows:

Model evaluation is shown in Code Listing 7:

import glob
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import label_binarize
import os
import pickle

def model_evaluation(model_file, test_csv, predicted_csv):

    for clf, test, predicted in zip(model_file, test_csv, predicted_csv):
        with open(clf, 'rb') as pickle_file:
            clf = pickle.load(pickle_file)

        test = pd.read_csv(test, header=0, encoding='gbk')
        predicted = pd.read_csv(predicted, header=0, encoding='gbk')
        test.columns = ['time', 'Trough', 'crest', 'IC', 'UC', 'P', 'Q', 'PF ', 'label']
        print('Model classification accuracy:', clf.score(test.drop(['label', 'time'], axis=1), test['label']))
        print('Model evaluation report:\\
', metrics.classification_report(test['label'], predicted))

        confusion_matrix0 = metrics.confusion_matrix(test['label'], predicted)
        confusion_matrix = pd.DataFrame(confusion_matrix0)
        class_names = list(set(test['label']))
 

        tick_marks = range(len(class_names))

        sns.heatmap(confusion_matrix, annot=True, cmap='YlGnBu', fmt='g')

        plt.xticks(tick_marks, class_names)
        plt.yticks(tick_marks, class_names)
        plt.tight_layout()

        plt.title('Confusion Matrix')
        plt.ylabel('real label')
        plt.xlabel('prediction label')
        plt.show()

        y_binarize = label_binarize(test['label'], classes=class_names)
        predicted = label_binarize(predicted, classes=class_names)

        fpr, tpr, thresholds = metrics.roc_curve(y_binarize.ravel(), predicted.ravel())
        auc = metrics.auc(fpr, tpr)
        print('Calculate auc:', auc)
        plt.figure(figsize=(8, 4))

        lw=2

        plt.plot(fpr, tpr, label='area = %0.2f' % auc)
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.fill_between(fpr, tpr, alpha=0.2, color='b')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('1-specificity')
        plt.ylabel('sensitivity')
        plt.title('ROC Curve')
        plt.legend(loc='lower right')
        plt.show()

model_evaluation(glob.glob('../tmp/*model.pkl'),
                 glob.glob('../tmp/*test.csv'),
                 glob.glob('../tmp/*predicted.csv'))

According to the analysis target, real-time power consumption needs to be calculated. The real-time power consumption is calculated as the product of the instantaneous electrical current, voltage and time. The formula is as follows.

Among them, is the real-time electricity consumption, the unit is 0.001kWh. is power, the unit is W.

Real-time power consumption calculation, the obtained real-time power consumption is shown in Table 3.

Calculate real-time power consumption as shown in code listing 8.

def cw(test_csv, predicted_csv, test_devices):

    for test, predicted, test_device in zip(test_csv, predicted_csv, test_devices):
        test = pd.read_csv(test, header=0, encoding='gbk')
        test.columns = ['time', 'Trough', 'crest', 'IC', 'UC', 'P', 'Q', 'PF ', 'label']
        test['time'] = pd.to_datetime(test['time'])
        test.index = test['time']

        predicteds = pd.read_csv(predicted, header=0, encoding='gbk')
        predicteds.columns = ['label']
        indexes = []

        class_names = list(set(test['label']))

        for j in class_names:
            index = list(predicteds.index[predicteds['label'] == j])
            indexes.append(index)

 
        from itertools import groupby
        dif_indexes = []
        time_indexes = []
        info_lists = pd.DataFrame()

        for y, z in zip(indexes, class_names):
            dif_index = []
            fun = lambda x: x[1] - x[0]

            for k, g in groupby(enumerate(y), fun):
                dif_list = [j for i, j in g]

                if len(dif_list) > 1:
                    scop = min(dif_list)

                else:
                    scop = dif_list[0]

                dif_index.append(scop)

            time_index = list(test.iloc[dif_index, :].index)
            time_indexes.append(time_index)
            info_list = pd.DataFrame({<!-- -->'time': time_index, 'model_device status': [z] * len(time_index)})

            dif_indexs.append(dif_index)
            info_lists = pd.concat([info_lists, info_list])

        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')
        test_devi['time'] = pd.to_datetime(test_devi['time'])
        test_devi['Real-time power consumption'] = test_devi['P'] * 100 / 3600
        info_lists = info_lists.merge(test_devi[['time', 'Real-time power consumption']],
                                      how='inner', left_on='time', right_on='time')

        info_lists = info_lists.sort_values(by=['time'], ascending=True)
        info_lists = info_lists.drop(['time'], axis=1)
        file_path = os.path.split(test_device)[1]
        info_lists.to_csv('../tmp/' + file_path[:3] + 'status table.csv', index=False, encoding='gbk')

        print(info_lists)
        
cw(glob.glob('../tmp/*test.csv'),
   glob.glob('../tmp/*predicted.csv'),
   glob.glob('../tmp/Attachment 2/*Device Data 1.csv'))

8. Recommended reading and fan benefits

What I recommend to you today is: Python data mining book: “Python Data Mining: Introduction, Advanced and Practical Case Analysis”

JD official purchase link: https://item.jd.com/13814157.html

“Python Data Mining: Introduction, Advanced and Practical Case Analysis” is a data mining book driven by actual project cases. It can help readers who have no Python programming foundation and data mining foundation to quickly master Python data mining. Technology, processes and methods. In terms of writing style, it is different from the traditional “combination of theory and practice” introductory books. It uses the well-known events in the field of data mining as the “Teddy Cup” Data Mining Challenge (which has been held for 10 years) and the “Teddy Cup” data analysis. Based on the Skills Competition (which has been held for 5 times) (more than 100,000 teachers and students from more than 1,500 colleges and universities participated), 11 classic competition questions were selected to integrate Python programming knowledge, data mining knowledge and industry knowledge, so that Readers can quickly master data mining methods in seven major industries including e-commerce, education, transportation, media, electric power, tourism, and manufacturing in practice.

This book is not only suitable for self-study by readers with no basic knowledge, but also suitable for teacher teaching. In order to help readers master the content of this book more efficiently, this book provides the following 10 additional values:

Modeling platform: Provides a one-stop big data mining modeling platform that requires no configuration and includes a large number of case projects. You can learn while practicing and say goodbye to talking on paper.
Video explanation: Provide no less than 600 minutes of teaching videos related to Python programming and data mining. Learn while watching and gain experience quickly.
Selected Exercises: Carefully select no less than 60 data mining exercises and provide detailed answers. Learn and practice while checking your knowledge blind spots.
Author Q&A: If you have any questions during the learning process, you can use the “Tree Hole” applet to take pictures of paper books and send them to the author with one click. You can learn while asking and get twice the result with half the effort.
Data files: Provide supporting data files for each case, combined with engineering practice, ready to use out of the box, enhancing practicality
Program code: Provides electronic files of the code in the book and installation packages of related tools. The code can be imported into the platform and run, and the learning effect is immediate.
Teaching courseware: Provides matching PPT courseware. Teachers who use this book as a teaching material can apply to save time in preparing lessons.
Model Service: Provides no less than 10 data mining models. The models provide a complete case implementation process to help improve data mining practice capabilities.
Teaching platform: Teddy Technology provides a one-stop data-based teaching platform for the additional resources provided in this book, with detailed operation guides. You can learn and practice while reading, saving time.
Employment recommendation: Provides a large number of employment recommendation opportunities and cooperates with 1,500+ companies, including well-known companies such as Huawei, JD.com, and Midea

By studying this book, readers can understand the principles of data mining, quickly master the relevant operations of big data technology, and lay a good technical foundation for subsequent data analysis, data mining, and deep learning practices and competitions.

Three books are given away this time

Activity time: until 2023-10-31

How to participate: Follow the blogger, like, favorite and comment below this article.

Two copies will be given to all fans for lottery, and the other will be given to students who purchase the column. Students who purchase the column can contact us via private message, first come first served, only one copy available.