Flush Supermind Quantitative Trading Financial Analysis Modeling — Data Mining Special Topic: Classification and Prediction

Data mining, also translated as data mining, refers to the process of searching for information hidden in large amounts of data through algorithms. This article mainly tells you how to use the KNN algorithm for data classification and data prediction.

Part 6: Data Mining Topic: Classification and Prediction

Introduction: Data mining, also translated as data mining, refers to the process of searching for information hidden in large amounts of data through algorithms. This article mainly tells you how to use the KNN algorithm for data classification and data prediction.

Basic Concept

Data classification means that information with the same content, same nature, and information that requires unified management are gathered together, and information that is different and needs to be managed separately is distinguished, and then the relationship between each collection is determined to form an organized classification. system.

Let’s take the simplest example: We define K-lines into three categories: “rising”: an increase of more than 1%, “down”: a decrease of more than 1%, and “shock” with an increase or decrease of less than 1%. Get the past of the Shanghai and Shenzhen 300 Index K-line of 250 trading days, classify the data:

In [2]:

p=get_price('000300.SH', None, '20180125', '1d', ['quote_rate'], True, None, 250, is_panel=1)
n1=len(p[p['quote_rate']>1])
n2=len(p[p['quote_rate']<-1])
n3=len(p)-n1-n2
print('Rising K-line: {}, falling K-line: {}, oscillating K-line: {}'.format(n1,n2,n3))
Rising K-line: 18, falling K-line: 12, shock K-line: 220

Data prediction is the prediction of unknown variables using a model derived from data classification. The purpose of prediction is to predict unknown variables in the future.

Suppose we use historical data to find the average proportion of rising stocks and the average volume ratio of all stocks. These two indicators can be used to define whether the market rises, falls, or fluctuates that day. Its characteristics are as follows:

Market Average proportion of rising stocks Average quantitative mean of all stocks
Rise (CSI 300 rose more than 1%) 60% 1.2
Fall (CSI 300 fell more than 1%) 40% 0.8
Shocked (CSI 300 rose or fell no more than 1%) 50% 1

The detailed data distribution is as follows:

In [4]:

import matplotlib.pyplot as plt
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 15
fig_size[1] = 10
x1 = np.random.uniform(0.5, 0.7,200)
y1 = np.random.uniform(1.4, 1, 200)

plt.scatter(x1,y1,c='b',marker='s',s=50,alpha=0.8)

x2 = np.random.uniform(0.3,0.5,200)
y2 = np.random.uniform(0.6,1,200)
plt.scatter(x2,y2,c='r', marker='^', s=50, alpha=0.8)

x3 = np.random.uniform(0.4,0.6,200)
y3 = np.random.uniform(0.8, 1.2, 200)
plt.scatter(x3,y3, c='g', s=50, alpha=0.8)

Out[4]:

<matplotlib.collections.PathCollection at 0x7fc0d7a4d908>

On the next trading day after we complete the classification model, we will only give you the specific values of two indicators: the proportion of rising stocks is 55%, and the average volume ratio of all stocks is 1. Is the Shanghai and Shenzhen 300 Index rising, falling or fluctuating that day?

KNN algorithm

If we need to solve the above classification and prediction problems, we need to rely on data mining algorithms. This article mainly introduces the KNN algorithm, a commonly used and simple data classification and prediction algorithm.

KNN algorithm core function: neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n-jobs=1)

The n_neighbors parameter in the function is the amount of reference data selected for classification, and the default is 5. The remaining parameters are selected as default, and you can understand the principles by yourself outside of class.

Before we begin, we first import the KNN algorithm module

In [20]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors

Return to the previous detailed data distribution

In [61]:

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 15
fig_size[1] = 10
x1 = np.random.uniform(0.5, 0.7,200)
y1 = np.random.uniform(1.4, 1, 200)

plt.scatter(x1,y1,c='b',marker='s',s=50,alpha=0.8)

x2 = np.random.uniform(0.3,0.5,200)
y2 = np.random.uniform(0.6,1,200)
plt.scatter(x2,y2,c='r', marker='^', s=50, alpha=0.8)

x3 = np.random.uniform(0.4,0.6,200)
y3 = np.random.uniform(0.8, 1.2, 200)
plt.scatter(x3,y3, c='g', s=50, alpha=0.8)

Out[61]:

<matplotlib.collections.PathCollection at 0x7fc0700275c0>

After creating a KNeighborsClassifier class, you need to give it data to learn and classify. Need to use fit() fitting function: neighbors.KNeighborsClassifier.fit(X,y)

X is a list or array of data. Each set of data can be a tuple, a list, or a one-dimensional array, but please note that the length of all data must be the same. Of course, X can also be understood as a matrix, where each row is the characteristic data of a sample.

y is a list or array with the same length as X, where each element is the classification label of the corresponding data in X.

After executing fit() on the training data, the KNeighborsClassifier class will generate a kd_tree or ball_tree based on the training data based on the algorithm parameters in the function.

The first category is an increase, the second category is a decline, and the third category is a shock.

In [62]:

x_val = np.concatenate((x1,x2,x3))
y_val = np.concatenate((y1,y2,y3))
x_diff = max(x_val)-min(x_val)
y_diff = max(y_val)-min(y_val)
x_normalized = x_val/x_diff
y_normalized = y_val/y_diff
xy_normalized = list(zip(x_normalized,y_normalized))
labels = [1]*200 + [2]*200 + [3]*200
clf = neighbors.KNeighborsClassifier(30)
clf.fit(xy_normalized, labels)

Out[62]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=30, p=2,
           weights='uniform')

Note: During classification, we standardized the data, mainly to make the data intervals of the X-axis and Y-axis comparable. For example, our proportion of rising stocks is always less than or equal to 1, and the volume ratio may even be greater than 2. In this case, the two The axes are not comparable and have inconsistent spacing, so we need to standardize.

Specific standardization formula: final value = initial value/(maximum value-minimum value)

At this point we have completed data classification.

Let’s try data prediction!

1. Obtain neighbor data

We try to get 10 data points around the data point (0.55,1)

In [63]:

nearests = clf.kneighbors([(0.55/x_diff, 1/y_diff)], 10, False)
nearests

Out[63]:

array([[565, 551, 557, 506, 174, 474, 503, 58, 401, 561]])

The result is displayed as the [565, 551, 557, 506, 174, 474, 503, 58, 401, 561] data point

2. Data prediction:

Continuing to answer the above question: the proportion of rising stocks is 55%, and the average volume ratio of all stocks is 1. I would like to ask whether the Shanghai and Shenzhen 300 Index rose, fell or fluctuated that day.

In [64]:

prediction = clf.predict([(0.55/x_diff, 1/y_diff)])
prediction

Out[64]:

array([3])

The results showed that the CSI 300 Index fell into the third category that day and was volatile.

3. Data prediction probability

In [65]:

prediction_proba = clf.predict_proba([(0.55/x_diff, 1/y_diff)])
prediction_proba

Out[65]:

array([[ 0.33333333, 0. , 0.66666667]])

The results show that the probabilities of (0.55,1) belonging to the first and third categories are 33.33% and 66.67% respectively.

Let’s try (0.9,1.2) prediction probability again

In [66]:

prediction_proba = clf.predict_proba([(0.9/x_diff, 1.2/y_diff)])
prediction_proba

Out[66]:

array([[ 1., 0., 0.]])

The result shows that the probability that (0.9,1.2) belongs to the first category is 100%.

4. KNN algorithm learning accuracy scoring

We classify the above data into 2 groups, one for KNN learning and one for testing.

In [67]:

#Build test group
x1_test = np.random.uniform(0.5, 0.7,100)
y1_test = np.random.uniform(1.4, 1, 100)

x2_test = np.random.uniform(0.3,0.5,100)
y2_test = np.random.uniform(0.6,1,100)

x3_test = np.random.uniform(0.4,0.6,100)
y3_test = np.random.uniform(0.8, 1.2, 100)

xy_test_normalized = list(zip(np.concatenate((x1_test,x2_test,x3_test))/x_diff,np.concatenate((y1_test,y2_test,y3_test))/y_diff))

labels_test = [1]*100 + [2]*100 + [3]*100

In [68]:

#Test group scoring
score = clf.score(xy_test_normalized, labels_test)
score

Out[68]:

0.85333333333333339

In [69]:

#Modify the n_neighbors parameter and continue to observe
clf1 = neighbors.KNeighborsClassifier(1)
clf1.fit(xy_normalized, labels)
clf1.score(xy_test_normalized, labels_test)

Out[69]:

0.81333333333333335

The results show that the learning accuracy of the model reaches about 85%. Parameter changes in the reference data volume do not cause significant changes in accuracy, so it is relatively stable.

To view the details of the above strategies, please go to the supermind quantitative trading official website to view: Financial Analysis Modeling — Data Mining Special Topic: Classification and Prediction