Flush Supermind quantitative trading financial analysis modeling – data processing: de-extreme value, standardization, neutralization

Three types of processing are often mentioned in general data preprocessing: extreme value removal, standardization, and neutralization. We will tell you about these three common data processing operations.

Part 5: Data processing topics: de-extreme values, standardization, and neutralization

Introduction: Three types of processing are often mentioned in general data preprocessing: de-extremum, standardization, and neutralization. We will tell you about these three common data processing operations.

1. Removing extreme values

When analyzing the year-on-year growth rate of net profit of listed companies in the current quarter, we are often disturbed by the data of some of them. For example, in the picture of Jiangxi Changyun, the year-on-year growth rate of net profit in the third quarter of 2017 was as high as 32836.04%! In fact, the year-on-year growth rate of net profit for the current quarter of most companies is far less than 1% of this value. Then the data de-extreme value operation is particularly critical, which can eliminate data interference items and improve the accuracy of data conclusions.

The general method of removing extreme values is to determine the upper and lower limits of the indicator, and then all data exceeding or lower than the limit will be the limit. There are three standards for judging upper and lower limits, namely MAD, 3σ, and percentile method.

Taking the pe value of the Shanghai and Shenzhen 300 constituent stocks as the original data, we will explain the MAD, 3σ, and percentile methods to everyone.

In [1]:

import numpy as np
import pandas as pd
import math
from statsmodels import regression
import statsmodels.api as sm
import matplotlib.pyplot as plt

date='20180125'
stock=get_index_stocks('000300.SH',date)
q = query(
    valuation.symbol,
    valuation.pe_ttm,
    valuation.current_market_cap
).filter(valuation.symbol.in_(stock))
data = get_fundamentals(q,date=date)
data.columns = [['symbol','pe_ratio','market_cap']]
data = data.set_index(data.symbol.values)
del data['symbol']
data['1/PE'] = 1/data['pe_ratio']
data.head()

Out[1]:

pe_ratio market_cap 1/PE
000001.SZ 10.59 2.402355e + 11 0.094429
000002.SZ 18.61 3.903084e + 11 0.053735
000008.SZ 54.53 1.661121e + 10 0.018339
000060. SZ 27.51 2.613508e + 10 0.036350
000063.SZ -115.27 1.237605e + 11 -0.008675
 Use the drawing function to display the data distribution of 1/PE: 

In [2]:

fig = plt.figure(figsize = (20, 8))
ax = data['1/PE'].plot.kde(label = 'Original_PE')
ax.legend()

Out[2]:

<matplotlib.legend.Legend at 0x7f777a239940>

1.MAD method:

MAD, also known as the median absolute difference method, is a method that first needs to calculate the sum of the distances between all factors and the average to detect outliers. The processing logic is:

The first step is to find the median Xmedian of all factors
Step 2: Get the absolute deviation value of each factor from the median Xi?Xmedian
Step 3: Get the median MAD of the absolute deviation value
Step 4: Determine the parameter n, thereby determining the reasonable range as [Xmedian?nMAD,Xmedian nMAD], and make the following adjustments for factor values that exceed the reasonable range:

MAD method code implementation:

In [4]:

def filter_extreme_MAD(series,n): #MAD: Median to extreme values
  median = series.quantile(0.5)
  new_median = ((series - median).abs()).quantile(0.50)
  max_range = median + n*new_median
  min_range = median - n*new_median
  return np.clip(series,min_range,max_range)

The result of MAD processing on the original data:

In [5]:

fig = plt.figure(figsize = (20, 8))
ax = data['1/PE'].plot.kde(label = 'Original_PE')
ax = filter_extreme_MAD(data['1/PE'],5).plot.kde(label = 'MAD')
ax.legend()

Out[5]:

<matplotlib.legend.Legend at 0x7f77219a5828>

2.3σ method

The 3σ method is also called the standard deviation method. The standard deviation itself can reflect the degree of dispersion of the factor and is based on the mean value of the factor Xmean. During outlier processing, the distance of a factor from the mean can be measured by using Xmean±nσ.

The processing logic of the standard deviation method is similar to that of the MAD method:
Step 1: Calculate the mean and standard deviation of the factors
Step 2: Confirm parameter n (n = 3 is selected here)
Step 3: Confirm that the reasonable range of factor values is [Xmean?nσ,Xmean nσ], and make the following adjustments to the factor values:

3σ code implementation:

In [6]:

def filter_extreme_3sigma(series,n=3): #3 sigma
  mean = series.mean()
  std = series.std()
  max_range = mean + n*std
  min_range = mean - n*std
  return np.clip(series,min_range,max_range)
The result of 3σ processing on the original data:

In [7]:

fig = plt.figure(figsize = (20, 8))
ax = data['1/PE'].plot.kde(label = 'Original_PE')
ax = filter_extreme_3sigma(data['1/PE']).plot.kde(label = '3sigma')
ax.legend()

Out[7]:

<matplotlib.legend.Legend at 0x7f7721a12ac8>

3.Percentile method:

Sort the factor values in ascending order, and adjust the factor values whose ranking percentile is higher than 97.5% or lower than 2.5% in a method similar to MAD and 3σ.

Percentile method code implementation:

In [8]:

def filter_extreme_percentile(series,min = 0.10,max = 0.90): #Percentile method
  series = series.sort_values()
  q = series.quantile([min,max])
  return np.clip(series,q.iloc[0],q.iloc[1])

fig = plt.figure(figsize = (20, 8))
ax = data['1/PE'].plot.kde(label = 'Original_PE')
ax = filter_extreme_percentile(data['1/PE']).plot.kde(label = 'Percentile')
ax.legend()

Out[8]:

<matplotlib.legend.Legend at 0x7f770c024ba8>

2. Standardization

Continue to do in-depth stock data analysis. Suppose we need to buy stocks with higher year-on-year net profit growth rates and higher dividend yields for the quarter. If we simply add the two indicators, and then take the stocks with the larger final value, Then you will face a very serious problem: the numerical meanings of the two indicators are different. It is difficult for a listed company to achieve a dividend rate of 10% and a net profit growth rate of 10% year-on-year. Generally, the dividend rate is less than 5%, while the net profit growth rate far exceeds 5%. Therefore, simple addition and stock selection will dilute the dividends. rate indicator.

In this case, data standardization can solve the problem.

Standardization has a series of meanings in statistics, and the z-score method is generally used. The processed data are converted from dimensionless to dimensionless, allowing comparison and regression of different indicators.

Introduction to z-score method

Standardized value = (original value – mean of all values in a single indicator) / standard deviation of all values in a single indicator

Z-score method code implementation:

In [9]:

def standardize_series(series): #original value method
  std = series.std()
  mean = series.mean()
  return (series-mean)/std
fig = plt.figure(figsize = (20, 8))
new = filter_extreme_3sigma(data['1/PE'])
ax = standardize_series(new).plot.kde(label = 'standard_1')
ax.legend()

Out[9]:

<matplotlib.legend.Legend at 0x7f770c02de48>
standard_2 = standardize_series(new.rank())
standard_2.head()

Out[10]:

000001.SZ 1.458269
000002.SZ 0.847295
000008.SZ-1.135490
000060.SZ 0.028820
000063.SZ-1.677298
Name: 1/PE, dtype: float64

3. Neutralization

If you continue to conduct in-depth stock data analysis, you may find that the P/E ratio of bank stocks is particularly low, while the P/E ratio of the Internet industry is particularly high. If you directly use the P/E ratio indicator, no matter you standardize or remove extreme values, you will only choose It is impossible to select Internet stocks without bank stocks. So how should the stocks with the lowest P/E ratio among the Internet and the stocks with the highest P/E ratio among the bank stocks reflect their true meaning?

In this case, you need to perform data neutralization.

The concept of neutralization: In order to eliminate the influence of other factors when using a certain factor, the selected stocks will be more dispersed. Standardization is used when multiple indicators of different magnitudes need to be compared with each other or the data needs to become concentrated, while the purpose of neutralization is to eliminate bias and unwanted effects in factors.

Specific method: According to most research reports, the main method for neutralization is to use regression to obtain a factor that is linearly independent of the risk factor. That is, by establishing a linear regression, extract the residual as the neutral New factors after sexualization. The correlation between the neutralized factors and risk factors after such processing is strictly zero.

Use python to achieve neutralization: the results of 3σ processing on the original data are neutralized.

data collection

In [11]:

import numpy as np
import pandas as pd
import math
from statsmodels import regression
import statsmodels.api as sm
import matplotlib.pyplot as plt

date='20180125'
stock=get_index_stocks('000300.SH',date)
q = query(
    valuation.symbol,
    valuation.pe_ttm,
    valuation.current_market_cap
).filter(valuation.symbol.in_(stock))
data = get_fundamentals(q,date=date)
data.columns = [['symbol','pe_ratio','market_cap']]
data = data.set_index(data.symbol.values)
del data['symbol']
data['1/PE'] = 1/data['pe_ratio']
data.head()

Out[11]:

pe_ratio market_cap 1/PE
000001.SZ 10.59 2.402355e + 11 0.094429
000002.SZ 18.61 3.903084e + 11 0.053735
000008.SZ 54.53 1.661121e + 10 0.018339
000060. SZ 27.51 2.613508e + 10 0.036350
000063.SZ -115.27 1.237605e + 11 -0.008675

3σ processing processing function

In [13]:

def filter_extreme_3sigma(series,n=3): #3 sigma
  mean = series.mean()
  std = series.std()
  max_range = mean + n*std
  min_range = mean - n*std
  return np.clip(series,min_range,max_range)

Industry code list

In [14]:

SHENWAN_INDUSTRY_MAP = {
        'S11' :'Agriculture, forestry, animal husbandry and fishery',
        'S21' :'Mining',
        'S22' :'Chemical Industry',
        'S23' :'Steel',
        'S24' :'Nonferrous metals',
        'S27' :'Electronic',
        'S28' :'Car',
        'S33' :'Household appliances',
        'S34' :'Food and Beverage',
        'S35' :'Textile and clothing',
        'S36' :'Light manufacturing',
        'S37' :'Medical Biology',
        'S41' :'Public Utilities',
        'S42' :'Transportation',
        'S43' :'Real Estate',
        'S45' :'Commercial Trade',
        'S46' :'Leisure Service',
        'S48' :'Bank',
        'S49' :'Non-bank finance',
        'S51' :'Comprehensive',
        'S61' :'Building Materials',
        'S62' :'Architectural Decoration',
        'S63' :'Electrical equipment',
        'S64' :'Mechanical equipment',
        'S65' :'National Defense Industry',
        'S71' :'Computer',
        'S72' :'Media',
        'S73' :'Communication'
        }

Industry code tag

In [15]:

def get_industry_exposure(order_book_ids):
    df = pd.DataFrame(index=SHENWAN_INDUSTRY_MAP.keys(), columns=order_book_ids)
    for stk in order_book_ids:
        try:
            df[stk][get_symbol_industry(stk).s_industryid1] = 1
        except:
            continue
    return df.fillna(0)#Assign NaN to 0

neutralizing function

In [16]:

# Need to pass in a single factor value and total market capitalization
def neutralization(factor,mkt_cap = False, industry = True):
  y = factor
  if type(mkt_cap) == pd.Series:
    LnMktCap = mkt_cap.apply(lambda x:math.log(x))
    if industry: #INDUSTRY, MARKET CAPITAL
      dummy_industry = get_industry_exposure(factor.index)
      x = pd.concat([LnMktCap,dummy_industry.T],axis = 1)
    else: #Only market value
      x = LnMktCap
  elif industry: #only industry
    dummy_industry = get_industry_exposure(factor.index)
    x = dummy_industry.T
  result = sm.OLS(y.astype(float),x.astype(float)).fit()
  return result.resid

function call

In [17]:

#Use 3sigma outlier processing method
no_extreme_PE = filter_extreme_3sigma(data['1/PE'])
#industrymarketcapneutral
new_PE_all = neutralization(no_extreme_PE,data['market_cap'])
#marketcapneutral
new_PE_MC = neutralization(no_extreme_PE,data['market_cap'],industry = False)
#industryneutral
new_PE_In = neutralization(no_extreme_PE)

fig = plt.figure(figsize = (20, 8))
ax = no_extreme_PE.plot.kde(label = 'no_extreme_PE')
ax = new_PE_all.plot.kde(label = 'new_PE_all')
ax = new_PE_MC.plot.kde(label = 'new_PE_MC')
ax = new_PE_In.plot.kde(label = 'new_PE_In')
ax.legend()

Out[17]:

<matplotlib.legend.Legend at 0x7f770bf1c630>

To view the details of the above strategies, please go to the official website of Supermind Quantitative Trading to view: Financial Analysis Modeling – Data Processing: Removing extreme values, standardization, and neutralization

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 375701 people are learning the system