Non-intrusive load detection and decomposition: a new perspective on power data mining

Electric power data mining

Overview
Case background
Analysis goals
Analysis process
data preparation
- Data exploration
- Missing value handling
Property construction
- Device data
- Weekly data
- Model training
Performance metrics
Recommended reading

Home page portal: Teleport

A little interlude before the text begins:
A few days ago I discovered a giant artificial intelligence learning website. The content is easy to understand and the articles are humorous. It is very helpful for learning artificial intelligence. Forbearance. I don’t live here to share it with you. Click Artificial Intelligence to jump to learning. I hope it can give you a little help if you are still confused about how to learn artificial intelligence.

Overview

Abstract: This case will deeply explore the current, voltage and power of each power equipment based on the collected power data, analyze the actual power consumption of each power equipment, and then provide power companies with Provide certain reference basis for formulating electric energy energy strategy. For more details, please refer to the book “Python Data Mining: Introduction to Advanced and Practical Case Analysis”.

Case background

In order to better monitor the energy consumption of electrical equipment, power sub-metering technology was born. Electric power sub-metering is of great significance for power companies to accurately predict power loads, scientifically formulate power grid dispatch plans, and improve the stability and reliability of power systems. For users, electricity sub-metering can help users understand the usage of electrical equipment, improve users’ awareness of energy conservation, and promote scientific and rational use of electricity.

Analysis goals

Based on the background and business needs of power data mining for non-intrusive load detection and decomposition, the goals to be achieved in this case are as follows:

Analyze the operating attributes of each electrical device.
Build a device identification attribute library.
The K nearest neighbor model is used to “decompose” the independent power consumption data of each electrical device from the entire line.

Analysis process

Data preparation

Data exploration

In the power data mining analysis of this case, operation record data will not be involved. Therefore, equipment data, cycle data and harmonic data are mainly obtained here. After obtaining the data, since there are many data tables and each table has many attributes, it is necessary to perform data exploration and analysis on the data. During the data exploration process, the data corresponding to the different attributes of each device was visualized mainly based on the characteristics of the original data. Some of the results obtained are shown in Figures 1 to 3.

Reactive power and total reactive power:

Current trace:

Voltage trace:

Data visualization code example:

import pandas as pd

import matplotlib.pyplot as plt

import os

 

filename = os.listdir('../data/attachment 1') # Get the names of all files in the folder

n_filename = len(filename)

# Add operation information to the data of each device, draw each attribute trajectory diagram and save it

def fun(a):

    save_name = ['YD1', 'YD10', 'YD11', 'YD2', 'YD3', 'YD4',

           'YD5', 'YD6', 'YD7', 'YD8', 'YD9']

    plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally

    plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally

    for i in range(a):

        Sb = pd.read_excel('../data/attachment1/' + filename[i], 'Device data', index_col = None)

        Xb = pd.read_excel('../data/attachment 1/' + filename[i], 'harmonic data', index_col = None)

        Zb = pd.read_excel('../data/Attachment 1/' + filename[i], 'Cycle data', index_col = None)

        # Current trace diagram

        plt.plot(Sb['IC'])

        plt.title(save_name[i] + '-IC')

        plt.ylabel('Current (0.001A)')

        plt.show()

        # Voltage trace graph

        lt.plot(Sb['UC'])

        plt.title(save_name[i] + '-UC')

        plt.ylabel('Voltage (0.1V)')

        plt.show()

        # Active power and total active power

        plt.plot(Sb[['PC', 'P']])

        plt.title(save_name[i] + '-P')

        plt.ylabel('Active power (0.0001kW)')

        plt.show()

        # Reactive power and total reactive power

        plt.plot(Sb[['QC', 'Q']])

        plt.title(save_name[i] + '-Q')

        plt.ylabel('Reactive power (0.0001kVar)')

        plt.show()

        # Power factor and total power factor

        plt.plot(Sb[['PFC', 'PF']])

        plt.title(save_name[i] + '-PF')

        plt.ylabel('Power factor (%)')

        plt.show()

        # Harmonic voltage

        plt.plot(Xb.loc[:, 'UC02':].T)

        plt.title(save_name[i] + '-harmonic voltage')

        plt.show()

        # Weekly data

        plt.plot(Zb.loc[:, 'IC001':].T)

        plt.title(save_name[i] + '-cycle data')

        plt.show()

 

fun(n_filename)

Missing value handling

Through data exploration, it was found that some “time” attributes in the data have missing values, and these missing values need to be processed. Since the missing time period of the “time” attribute in each piece of data is different, different processing is required. The data with a larger missing time period in each device data is deleted, and the data with a smaller missing time period is interpolated using the previous value.

Before processing missing values, it is necessary to add the equipment data table, cycle data table, harmonic data table and operation record table in all equipment data in the training data, as well as the equipment data table, cycle data table in all equipment data in the test data. and harmonic data tables are extracted as independent data files, and some of the generated files are shown in Figure 4.

Extract data file code example:

# Convert xlsx file to CSV file

import glob

import pandas as pd

import math

 

def file_transform(xls):

    print('A total of %s xlsx files found' % len(glob.glob(xls)))

    print('Processing......')

    for file in glob.glob(xls): # Loop to read xlsx files in the same folder

        combine1 = pd.read_excel(file, index_col=0, sheet_name=None)

        for key in combine1:

            combine1[key].to_csv('../tmp/' + file[8: -5] + key + '.csv', encoding='utf-8')

    print('Processing completed')

 

xls_list = ['../data/Attachment 1/*.xlsx', '../data/Attachment 2/*.xlsx']

file_transform(xls_list[0]) # Process training data

file_transform(xls_list[1]) # Process test data

After the extraction of data files is completed, the extracted data files are processed for missing values. Some of the files generated after processing are shown in Figure 5.

Missing value handling code example:

# Delete the larger missing time point data in each data file, and replace the smaller missing time point data with the previous value.

def missing_data(evi):

    print('A total of %s CSV files found' % len(glob.glob(evi)))

    for j in glob.glob(evi):

        fr = pd.read_csv(j, header=0, encoding='gbk')

        fr['time'] = pd.to_datetime(fr['time'])

        helper = pd.DataFrame({<!-- -->'time': pd.date_range(fr['time'].min(), fr['time'].max(), freq='S')})

        fr = pd.merge(fr, helper, on='time', how='outer').sort_values('time')

        fr = fr.reset_index(drop=True)

 

        frame = pd.DataFrame()

        for g in range(0, len(list(fr['time'])) - 1):

            if math.isnan(fr.iloc[:, 1][g + 1]) and math.isnan(fr.iloc[:, 1][g]):

                continue

            else:

                scop = pd.Series(fr.loc[g])

                frame = pd.concat([frame, scop], axis=1)

        frame = pd.DataFrame(frame.values.T, index=frame.columns, columns=frame.index)

        frames = frame.fillna(method='ffill')

        frames.to_csv(j[:-4] + '1.csv', index=False, encoding='utf-8')

    print('Processing completed')

 

evi_list = ['../tmp/Attachment 1/*Data.csv', '../tmp/Attachment 2/*Data.csv']

missing_data(evi_list[0]) # Process training data

missing_data(evi_list[1]) # Process test data

Attribute construction

Although the attributes were initially processed during the data preparation process, too many attributes were introduced, and there was duplicate information among these attributes. In order to retain important attributes and establish an accurate and simple model, the original attributes need to be further screened and constructed.

Device data

During the data exploration process, it was found that the reactive power, total reactive power, active power, total active power, power factor and total power factor of different equipment are very different and have a high degree of discrimination, so reactive power was selected in this case , total reactive power, active power, total active power, power factor and total power factor are used as attributes of equipment data to build a discriminant attribute library.

After handling the missing values, the data of each device has changed from one table to multiple tables, so it is necessary to merge the same type of data tables into one table, such as merging the device data tables of all devices into one table. At the same time, because one of the ways to deal with missing values is to use the previous value for interpolation, the same records are generated, and repeated records need to be processed. The data table generated after processing is shown in Table 1.

Code example for merging and deduplicating device data:

import glob

import pandas as pd

import os

 

# Merge 11 device data and process duplicate data in the merge

def combined_equipment(csv_name):

    # merge

    print('A total of %s CSV files found' % len(glob.glob(csv_name)))

    print('Processing......')

    for i in glob.glob(csv_name): # Loop to read CSV files in the same folder

        fr = open(i, 'rb').read()

        file_path = os.path.split(i)

        with open(file_path[0] + '/device_combine.csv', 'ab') as f:

            f.write(fr)

    print('Merger completed!')

    # Remove duplicates

    df = pd.read_csv(file_path[0] + '/device_combine.csv', header=None, encoding='utf-8')

    datalist = df.drop_duplicates()

    datalist.to_csv(file_path[0] + '/device_combine.csv', index=False, header=0)

    print('Duplication removal completed')

 

csv_list = ['../tmp/Attachment 1/*Device Data 1.csv', '../tmp/Attachment 2/*Device Data 1.csv']

combined_equipment(csv_list[0]) # Process training data

combined_equipment(csv_list[1]) # Process test data

Cycle data

During the data exploration process, it was found that the current in the cycle data fluctuates greatly with time. The fluctuations in the line graph drawn by the current in the cycle data of different devices are not the same, and there are obvious differences. Therefore, this case Select wave peaks and wave troughs as attributes of the cycle data to build a discriminant attribute library.

Since the two attributes of current peaks and troughs do not exist in the original cycle data, attribute construction needs to be performed. The data table generated by the construction is shown in Table 2.

Example of code for constructing attributes in weekly data:

# Obtain the peaks and troughs of the current in the cycle data as attribute parameters

import glob

import pandas as pd

from sklearn.cluster import KMeans

import os

 

def cycle(cycle_file):

    for file in glob.glob(cycle_file):

        cycle_YD = pd.read_csv(file, header=0, encoding='utf-8')

        cycle_YD1 = cycle_YD.iloc[:, 0:128]

        models = []

        for types in range(0, len(cycle_YD1)):

            model = KMeans(n_clusters=2, random_state=10)

            model.fit(pd.DataFrame(cycle_YD1.iloc[types, 1:])) # All columns except time

            models.append(model)

 

        # Calculate the average value smoothly between the same states

        mean = pd.DataFrame()

        for model in models:

            r = pd.DataFrame(model.cluster_centers_, ) # Find the cluster center

            r = r.sort_values(axis=0, ascending=True, by=[0])

            mean = pd.concat([mean, r.reset_index(drop=True)], axis=1)

        mean = pd.DataFrame(mean.values.T, index=mean.columns, columns=mean.index)

        mean.columns = ['Trough', 'Peak']

        mean.index = list(cycle_YD['time'])

        mean.to_csv(file[:-9] + 'trough peak.csv', index=False, encoding='gbk ')

 

cycle_file = ['../tmp/attachment 1/*cycle data 1.csv', '../tmp/attachment 2/*cycle data 1.csv']

cycle(cycle_file[0]) # Process training data

cycle(cycle_file[1]) # Process test data

 

# Merge peak and trough files of cycles

def merge_cycle(cycles_file):

    means = pd.DataFrame()

    for files in glob.glob(cycles_file):

        mean0 = pd.read_csv(files, header=0, encoding='gbk')

        means = pd.concat([means, mean0])

    file_path = os.path.split(glob.glob(cycles_file)[0])

    means.to_csv(file_path[0] + '/zuhe.csv', index=False, encoding='gbk')

    print('Merger completed')

 

cycles_file = ['../tmp/Attachment 1/*Trough and Peak.csv', '../tmp/Attachment 2/*Trough and Peak.csv']

merge_cycle(cycles_file[0]) #Training data

merge_cycle(cycles_file[1]) #Test data

Model training

When identifying the device type, select the K nearest neighbor model for identification, use the attribute library constructed from attributes to train the model, and then use the trained model to identify device 1 and device 2. Build a discriminant model and identify the device type, as shown in code listing 6.

Build a discrimination model and identify the device type. Code example:

import glob

import pandas as pd

from sklearn import neighbors

import pickle

import os

 

# Model training

def model(test_files, test_devices):

    # Training set

    zuhe = pd.read_csv('../tmp/Attachment 1/zuhe.csv', header=0, encoding='gbk')

    device_combine = pd.read_csv('../tmp/Attachment 1/device_combine.csv', header=0, encoding='gbk')

    train = pd.concat([zuhe, device_combine], axis=1)

    train.index = train['time'].tolist() # Set the "time" column as the index

    train = train.drop(['PC', 'QC', 'PFC', 'time'], axis=1)

    train.to_csv('../tmp/' + 'train.csv', index=False, encoding='gbk')

    # test set

    for test_file, test_device in zip(test_files, test_devices):

        test_bofeng = pd.read_csv(test_file, header=0, encoding='gbk')

        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')

        test = pd.concat([test_bofeng, test_devi], axis=1)

        test.index = test['time'].tolist() # Set the "time" column as the index

        test = test.drop(['PC', 'QC', 'PFC', 'time'], axis=1)

 

        # K nearest neighbor

        clf = neighbors.KNeighborsClassifier(n_neighbors=6, algorithm='auto')

        clf.fit(train.drop(['label'], axis=1), train['label'])

        predicted = clf.predict(test.drop(['label'], axis=1))

        predicted = pd.DataFrame(predicted)

        file_path = os.path.split(test_file)[1]

        test.to_csv('../tmp/' + file_path[:3] + 'test.csv', encoding='gbk')

        predicted.to_csv('../tmp/' + file_path[:3] + 'predicted.csv', index=False, encoding='gbk')

        with open('../tmp/' + file_path[:3] + 'model.pkl', 'ab') as pickle_file:

            pickle.dump(clf, pickle_file)

        print(clf)

 

model(glob.glob('../tmp/attachment 2/*trough and peak.csv'),

      glob.glob('../tmp/Attachment 2/*Device Data 1.csv'))

Performance Metrics

Based on the device identification results in code listing 6, perform model evaluation on the model, and the results obtained are as follows

The confusion matrix is shown in Figure 7:

The ROC curve is shown in Figure 8:

Model evaluation code example:

import glob

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import metrics

from sklearn.preprocessing import label_binarize

import os

import pickle

 

# Model evaluation

def model_evaluation(model_file, test_csv, predicted_csv):

    for clf, test, predicted in zip(model_file, test_csv, predicted_csv):

        with open(clf, 'rb') as pickle_file:

            clf = pickle.load(pickle_file)

        test = pd.read_csv(test, header=0, encoding='gbk')

        predicted = pd.read_csv(predicted, header=0, encoding='gbk')

        test.columns = ['time', 'Trough', 'crest', 'IC', 'UC', 'P', 'Q', 'PF ', 'label']

        print('Model classification accuracy:', clf.score(test.drop(['label', 'time'], axis=1), test['label']))

        print('Model evaluation report:\\
', metrics.classification_report(test['label'], predicted))

 

        confusion_matrix0 = metrics.confusion_matrix(test['label'], predicted)

        confusion_matrix = pd.DataFrame(confusion_matrix0)

        class_names = list(set(test['label']))

 

        tick_marks = range(len(class_names))

        sns.heatmap(confusion_matrix, annot=True, cmap='YlGnBu', fmt='g')

        plt.xticks(tick_marks, class_names)

        plt.yticks(tick_marks, class_names)

        plt.tight_layout()

        plt.title('Confusion Matrix')

        plt.ylabel('real label')

        plt.xlabel('prediction label')

        plt.show()

        y_binarize = label_binarize(test['label'], classes=class_names)

        predicted = label_binarize(predicted, classes=class_names)

 

        fpr, tpr, thresholds = metrics.roc_curve(y_binarize.ravel(), predicted.ravel())

        auc = metrics.auc(fpr, tpr)

        print('Calculate auc:', auc)

        # drawing

        plt.figure(figsize=(8, 4))

        lw=2

        plt.plot(fpr, tpr, label='area = %0.2f' % auc)

        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

        plt.fill_between(fpr, tpr, alpha=0.2, color='b')

        plt.xlim([0.0, 1.0])

        plt.ylim([0.0, 1.05])

        plt.xlabel('1-specificity')

        plt.ylabel('sensitivity')

        plt.title('ROC Curve')

        plt.legend(loc='lower right')

        plt.show()

 

model_evaluation(glob.glob('../tmp/*model.pkl'),

                 glob.glob('../tmp/*test.csv'),

                 glob.glob('../tmp/*predicted.csv'))

According to the analysis goal, real-time power consumption needs to be calculated. The real-time power consumption is calculated as the product of the instantaneous electrical current, voltage and time. The formula is as follows.

Among them, is the real-time power consumption, the unit is 0.001kWh. is power, unit is W.

Real-time power consumption calculation, the obtained real-time power consumption is shown in Table 3.

Calculate real-time power consumption

# Calculate real-time power consumption and output status table

def cw(test_csv, predicted_csv, test_devices):

    for test, predicted, test_device in zip(test_csv, predicted_csv, test_devices):

        # Divide the predicted timetable

        test = pd.read_csv(test, header=0, encoding='gbk')

        test.columns = ['time', 'Trough', 'crest', 'IC', 'UC', 'P', 'Q', 'PF ', 'label']

        test['time'] = pd.to_datetime(test['time'])

        test.index = test['time']

        predicteds = pd.read_csv(predicted, header=0, encoding='gbk')

        predicteds.columns = ['label']

        indexes = []

        class_names = list(set(test['label']))

        for j in class_names:

            index = list(predicteds.index[predicteds['label'] == j])

            indexes.append(index)

 

        # Get the first serial number and time point

        from itertools import groupby # Continuous numbers

        dif_indexes = []

        time_indexes = []

        info_lists = pd.DataFrame()

        for y, z in zip(indexes, class_names):

            dif_index = []

            fun = lambda x: x[1] - x[0]

            for k, g in groupby(enumerate(y), fun):

                dif_list = [j for i, j in g] # List of consecutive numbers

                if len(dif_list) > 1:

                    scop = min(dif_list) # Select the first one in the range of consecutive numbers

                else:

                    scop = dif_list[0]

                dif_index.append(scop)

            time_index = list(test.iloc[dif_index, :].index)

            time_indexes.append(time_index)

            info_list = pd.DataFrame({<!-- -->'time': time_index, 'model_device status': [z] * len(time_index)})

            dif_indexs.append(dif_index)

            info_lists = pd.concat([info_lists, info_list])

        # Calculate real-time power consumption and save the status table

        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')

        test_devi['time'] = pd.to_datetime(test_devi['time'])

        test_devi['Real-time power consumption'] = test_devi['P'] * 100 / 3600

        info_lists = info_lists.merge(test_devi[['time', 'Real-time power consumption']],

                                      how='inner', left_on='time', right_on='time')

        info_lists = info_lists.sort_values(by=['time'], ascending=True)

        info_lists = info_lists.drop(['time'], axis=1)

        file_path = os.path.split(test_device)[1]

        info_lists.to_csv('../tmp/' + file_path[:3] + 'status table.csv', index=False, encoding='gbk')

        print(info_lists)

 

cw(glob.glob('../tmp/*test.csv'),

   glob.glob('../tmp/*predicted.csv'),

   glob.glob('../tmp/Attachment 2/*Device Data 1.csv'))