Use the DolphinDB TopN function to explore efficient alpha factors

DolphinDB already has a lot of window calculation functions, such as the sliding window calculation of the m series, the cumulative window calculation of the cum series, and the time window sliding calculation of the tm series. However, all such functions perform indicator calculations on all records in the window, which inevitably contains a lot of noise.

According to DolphinDB’s feedback from users in the financial field, the calculation indicators obtained by filtering the records in the window through transaction volume information have higher quality, and trading strategies based on this can bring more Alpha. At the same time, users also reported that it takes too long to calculate indicators filtered by additional information through custom functions.

To this end, DolphinDB has launched the TopN series of built-in functions, covering the mTopN series, tmTopN series, and cumTopN series, which greatly improves performance through incremental calculations. Both DolphinDB versions 2.00.10 and 1.30.22 support the features covered in this tutorial.

This tutorial will introduce the TopN series of functions from the following perspectives:

1. Problems, calculation rules and implementations that TopN series functions can solve

1.1 Pain points solved by TopN series functions

In group calculation, it is often not necessary to calculate all the data in the group, and sometimes it is only necessary to calculate the topn elements in the group. For example, if you want to count the average Volume value of each stock whose Volume value is less than the first quartile every day, you cannot use percentile The code> function is directly evaluated, because the sql statement first performs conditional filtering and then group calculation, so the top element of each group cannot be taken out.

Assume that the existing table tb is a data table of two stocks. After grouping by stock and date, use percentile to filter and count the percentile of each stock factor in the top 40% every day. Average of >value.

trade_date=sort(take(2017.01.11..2017.01.12,20))
secu_code=take(`600570`600000,20)
value=1..20
tb=table(trade_date, secu_code, value) 

It is common practice to write custom aggregate functions to achieve:

defg percentile_40(x){
ret = NULL
y = percentile(x,40)
cc = sum(x<y)
if (cc > 0){
ret = sum(iif(x<y,x,0))\cc
}
return return
}
select percentile_40(value) as factor_value from tb group by trade_date, secu_code

#output
trade_date secu_code factor_value
---------- --------- ------------
2017.01.11 600000 3
2017.01.11 600570 2
2017.01.12 600000 13
2017.01.12 600570 12 

In response to similar situations, DolphinDB introduced the topN series of functions to solve such problems. The topN function can sort the data according to a certain index, and calculate the top top elements or the top elements after sorting, which can significantly improve the readability of the script.

select aggrTopN(avg, funcArgs=value, sortingCol=value, top=0.4, ascending=true) as factor_value from tb group by trade_date, secu_code

#output
trade_date secu_code factor_value
---------- --------- ------------
2017.01.11 600000 3
2017.01.11 600570 2
2017.01.12 600000 13
2017.01.12 600570 12

In addition to the aggregation function aggrTopN, DolphinDB also introduces mTopN, cumTopN, tmTopN. For the sorting column of topN, we usually use indicators such as rise and fall, trading volume, etc. as sorting keys. In the following chapters 2 and 3, we will give specific examples of the application scenarios of mTopN, cumTopN, and tmTopN.

1.2 Calculation rules and implementation of TopN series functions

The TopN series functions can basically be summarized into the following types: mfuncTopN, tmfuncTopN and cumfuncTopN. In addition, there is a corresponding higher-order function aggrTopN.

Take mfuncTopN(X, S, window, top, [ascending=true], [tiesMethod]) as an example, the calculation process is:

  1. Sort X stably according to S (the sorting method is specified by ascending, and the default true is ascending order)
  2. Take the top top elements of the sorted result for calculation. If there are multiple elements with the same value that cannot all enter the front top, the selection rules for these values can be set through the tiesMethod parameter. To put it simply, when tiesMethod is latest, the latest data will be selected first, when it is oldest, the oldest data will be selected first, and when it is all, all data will be selected first. For specific usage rules, please refer to the TopN user manual (TopN series – DolphinDB 2.0 documentation ).

There are currently 36 computing functions that support TopN:

Basic functions mTopN series tmTopN series< /strong> cumTopN series
sum msumTopN tmsumTopN cumsumTopN
avg mavgTopN tmavgTopN cumavgTopN
std mstdTopN tmstdTopN cumstdTopN
stdp mstdpTopN tmstdpTopN cumstdpTopN
var mvarTopN tmvarTopN cumvarTopN
varp mvarpTopN tmvarpTopN cumvarpTopN
skew mskewTopN tmskewTopN cumskewTopN
kurtosis mkurtosisTopN tmkurtosisTopN cumkurtosisTopN
beta mbetaTopN tmbetaTopN cumbetaTopN
corr mcorrTopN tmcorrTopN cumcorrTopN
covar mcovarTopN tmcovarTopN cumcovarTopN
wsum mwsumTopN tmwsumTopN cumwsumTopN

There are many scenarios for the TopN calculation of the sliding window, such as calculating the average price of the 3 records with the largest trading volume in each stock window, or calculating the average energy consumption of the 5 records with the highest temperature in the window of each instrument, etc. The specific application scenarios will be introduced in the application scenarios of the mTopN and tmTopN series in the second chapter of this tutorial.

The TopN calculation of the cumulative window involves more calculations, such as calculating the sum of the trading volume of the days with the largest historical increase for each stock, etc. The specific application scenarios will be in the application scenarios of the cumTopN series in the third chapter of this tutorial.

In addition, if users want to customize the calculation rules after taking out TopN, DolphinDB has also opened the aggrTopN function, and users can use this function with other high-level functions such as moving, tmoving, etc., implement your own mfuncTopN function. The scenarios and applications of this part will be expanded in the implementation method of customizing TopN in Chapter 4.

Finally, the TopN series functions all support the integration of batch and stream. In Chapter 5, TopN’s batch-stream integration scenario, we will introduce how to apply the TopN series factors in batch computing to the real-time computing stream engine.

2. Application scenarios of mTopN and tmTopN series

The TopN calculation of the sliding window is based on the records in the window, by sorting first, taking the TopN, and then doing the aggregation calculation. This chapter is illustrated with some examples that will be used in practical scenarios.

The data used in this chapter is simulated by the following script:

n = 5*121
timeVector = 2023.04.30T09:30:00.000 + 0..120 * 60000
tradingTime = take(timeVector,n)
windCode = stretch(format(600001..600005, "000000") + ".SH", n)
open = (20.00 + 0.01*0..120) join (30.00-0.01*0..120) join (40.00 + 0.01*0..120) join (50.00-0.01*0..120) join (60.00 + 0.01 *0..120)
high = (20.50 + 0.01*0..120) join (31.00-0.01*0..120) join (40.80 + 0.01*0..120) join (50.90-0.01*0..120) join (60.70 + 0.01 *0..120)
low = (19.50 + 0.01*0..120) join (29.00-0.01*0..120) join (39.00 + 0.01*0..120) join (48.00-0.01*0..120) join (59.00 + 0.01 *0..120)
close = (20.00 + 0.01*0..120) join (30.00-0.01*0..120) join (40.00 + 0.01*0..120) join (50.00-0.01*0..120) join (60.00 + 0.01 *0..120)
volume= 10000 + take(-100..100,n)
t = table(tradingTime, windCode, open, high, low, close, volume)

2.1 mTopN application scenarios

mTopN can slide according to the number of records, and the window length can be calculated according to the number of records or the length of time. For specific sliding rules, please refer to the window calculation overview tutorial: window_cal.md Zhejiang Zhiyu Technology Co., Ltd./Tutorials_CN – Gitee

For the minute table t in the simulated data, to get the average price of the 3 records with the largest trading volume in every 5 records of each stock, you can use the mavgTopN function with context by solves:

select windCode, tradingTime, mavgTopN(close, volume, 5, 3, false) as mavgTop3Close from t context by windCode
//output
windCode tradingTime mavgTop3Close
--------- ----------------------- ------------------
600001.SH 2023.04.30T09:30:00.000 20
600001.SH 2023.04.30T09:31:00.000 20.005
600001.SH 2023.04.30T09:32:00.000 20.01
600001.SH 2023.04.30T09:33:00.000 20.02
600001.SH 2023.04.30T09:34:00.000 20.03
...

Generally speaking, when writing a factor, the calculated value does not have a unit. For example, the average increase of the ten records with the largest trading volume in a window size of 100 minutes is calculated:

select windCode, tradingTime, mavgTopN(ratios(close), volume, 100, 10, false) as mavgTop10RatioClose from t context by windCode, date(tradingTime)
//output
windCode tradingTime mavgTop10RatioClose
--------- -----------------------------------------
600001.SH 2023.04.30T09:30:00.000
600001.SH 2023.04.30T09:31:00.000 1.0005
600001.SH 2023.04.30T09:32:00.000 1.000499875
600001.SH 2023.04.30T09:33:00.000 1.000499750
600001.SH 2023.04.30T09:34:00.000 1.000499625
...

In addition to monocular TopN, DolphinDB also supports binocular operations, such as beta, corr, covar and other TopN operators. For example, to calculate the correlation between the two factors of the 3 records with the highest prices in every 5 records of each stock, you can use the mcorrTopN function with context by to solve:

select windCode, tradingTime, mcorrTopN(low, close * volume, log(ratios(close)), 5, 3, false) as mcorrTop3CloseVol from t context by windCode, date(tradingTime)
//output (because it is analog data, it is distorted)
windCode tradingTime mcorrTop3CloseVol
--------- -----------------------------------------
600001.SH 2023.04.30T09:30:00.000
600001.SH 2023.04.30T09:31:00.000
600001.SH 2023.04.30T09:32:00.000 1.00000
600001.SH 2023.04.30T09:33:00.000 0.99999
600001.SH 2023.04.30T09:34:00.000 0.99999

2.2 tmTopN application scenarios

The window calculation of mTopN is to take the window size according to the number of records. The window size of the tmTopN function can be a time interval, either 5 minutes or 20 seconds, and so on. In the tmTopN series of functions, top can be a floating-point number between 0 and 1, indicating a percentage, such as 0.2, which means selecting 20% of the records in the window.

For example, in the processing of the above data, the calculation time window is 3 minutes, and the average increase of the two records with the largest transaction volume is:

select windCode, tradingTime, tmavgTopN(tradingTime, ratios(close), volume, 3m, 2, false) as tmavgTop2RatioClose from t context by windCode, date(tradingTime)
//output
windCode tradingTime tmavgTop2RatioClose
--------- -----------------------------------------
600001.SH 2023.04.30T09:30:00.000
600001.SH 2023.04.30T09:31:00.000 1.0005
600001.SH 2023.04.30T09:32:00.000 1.000499875
600001.SH 2023.04.30T09:33:00.000 1.000499625
600001.SH 2023.04.30T09:34:00.000 1.0004993758
...

The TopN series of functions also realize the calculation of the correlation function, so for example, the close and volume correlations of the three records with the largest transaction volume within the 5-minute window period can be calculated:

select windCode, tradingTime, tmcorrTopN(tradingTime, close, volume, volume, 5m, 3, false) as tmavgTop3CorrCloseVolume from t context by windCode, date(tradingTime)

//output
windCode tradingTime tmavgTop3CorrCloseVolume
--------- ----------------------- ------------------ ------
600001.SH 2023.04.30T09:30:00.000
600001.SH 2023.04.30T09:31:00.000 0.999999990552169
600001.SH 2023.04.30T09:32:00.000 1.000000001625267
600001.SH 2023.04.30T09:33:00.000 1.000000006877599
600001.SH 2023.04.30T09:34:00.000 1.000000012129931
...

3. Application scenarios of cumTopN series

Unlike the sliding calculations of mTopN and tmTopN, the cumTopN series calculates the cumulative window. For example, to calculate the sum of the transaction volume of the three records with the largest increase in history, you can use the cumsumTopN function to achieve:

select windCode, tradingTime, cumsumTopN(volume, ratios(close), 3, false) as cumsumTop3Volume from t context by windCode

//output
windCode tradingTime cumsumTop3Volume
--------- -----------------------------------------
600001.SH 2023.04.30T09:30:00.000
600001.SH 2023.04.30T09:31:00.000 9901
600001.SH 2023.04.30T09:32:00.000 19803
600001.SH 2023.04.30T09:33:00.000 29706
600001.SH 2023.04.30T09:34:00.000 29706
...

As can be seen from the above script, the parameters inside the function can also be calculation results, not necessarily field names, showing the high flexibility of TopN function usage.

4. Custom TopN function

The first chapter lists the built-in TopN series functions currently supported by DolphinDB. Considering the diverse computing needs of users, DolphinDB also opens up a custom interface. Users can customize the aggregation calculation method of the TopN function according to actual needs.

Note: The optimized built-in TopN family of functions outperforms custom TopN functions.

4.1 Implementation method of custom TopN

The higher-order function corresponding to the TopN series is aggrTopN(func, funcArgs, sortingCol, top, [ascending=true]), and the func parameter can accept aggregate functions. Therefore, if the user wants to use a complex custom aggregate function to calculate TopN, such as calculating the average value of OHLC in the top 40% of records with the largest transaction volume, it can be implemented like this:

//Use defg to customize the aggregation function avgOHLC to define the aggregation behavior after removing TopN
defg avgOHLC(price){
return avg(price)
}

select aggrTopN(avgOHLC, funcArgs =(open + high + low + close) , sortingCol=volume, top=0.4, ascending=true) as factor_value from t group by windCode

//output
windCode factor_value
--------- -------------------
600001.SH 80.94
600002.SH 116.56583
600003.SH 160.74
600004.SH 196.40
600005.SH 240.76167

Similarly, the window functions of mTopN and tmTopN series also support customization. For example, the above avgOHLC function, users can apply it to the moving, tmoving functions:

select windCode, tradingTime, moving(aggrTopN{avgOHLC,,,0.4,true},(open + high + low + close, volume),10,1) as udfmTopN from t context by windCode

//output
windCode tradingTime udfmTopN
--------- ----------------------- ------------------
600001.SH 2023.04.30T09:30:00.000 80
...
600001.SH 2023.04.30T09:39:00.000 80.06
600001.SH 2023.04.30T09:40:00.000 80.10
600001.SH 2023.04.30T09:41:00.000 80.14
...

4.2 Performance comparison between built-in TopN and custom TopN

The m series functions are optimized for their respective computing scenarios, so they have better performance than moving higher-order functions. The above calculation can also be done with the built-in function mavgTopN:

select windCode, tradingTime, mavgTopN(open + high + low + close, volume,10,4) as udfmTopN from t context by windCode

//output
windCode tradingTime udfmTopN
--------- ----------------------- ------------------
600001.SH 2023.04.30T09:30:00.000 80
...
600001.SH 2023.04.30T09:39:00.000 80.06
600001.SH 2023.04.30T09:40:00.000 80.10
600001.SH 2023.04.30T09:41:00.000 80.14
...

In terms of performance, the calculation efficiency of the built-in TopN series functions is much higher than that of the custom TopN functions:

  • Test Equipment
    • CPU: Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
    • Operating system: 64-bit CentOS Linux 7 (Core)
    • DolphinDB deployment type: version 2.00.10, single node
//moving + aggrTopN + avgOHLC
timer(10000) select windCode, tradingTime, moving(aggrTopN{avgOHLC,,,0.4,true},(open + high + low + close, volume),10,1) as udfmTopN from t context by windCode
//8394 ms

//moving + aggrTopN + avg
timer(10000) select windCode, tradingTime, moving(aggrTopN{avg,,,0.4,true},(open + high + low + close, volume),10,1) as udfmTopN from t context by windCode
//6812 ms

//mavgTopN
timer(10000) select windCode, tradingTime, mavgTopN(open + high + low + close, volume,10,4) as udfmTopN from t context by windCode
//1394 ms
TopN function Time-consuming to run 10,000 times
moving + aggrTopN + avgOHLC 8,394 ms
moving + aggrTopN + avg 6,812 ms
mavgTopN 1,394 ms

5. TopN’s integrated batch-flow scenario

In the previous chapters, all the scenarios of offline computing were introduced. The TopN series of functions currently also support stream computing scenarios. For example, calculate the average increase of the ten records with the largest trading volume in the window size of 100 minutes:

//Functions in offline computing can be directly filled in the flow engine
factor = <mavgTopN(ratios(close), volume, 100, 10, false)>

//Define the input and output table structure
share streamTable(1:0, `tradingTime`windCode`open`high`low`close`volume, [TIMESTAMP,STRING,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT]) as tickStream
result = table(1000:0, `windCode`tradingTime`mavgTop10RatioClose, [STRING,TIMESTAMP,DOUBLE])

//Define stream computing engine
rse = createReactiveStateEngine(name="streamTopN", metrics =[<tradingTime>, factor], dummyTable=t, outputTable=result, keyColumn="windCode")

//Subscribe to the flow table and play back data
subscribeTable(tableName=`tickStream, actionName="mTopN", handler=tableInsert{rse})
replay(inputTables=t.copy().sortBy!(`tradingTime), outputTables=tickStream, timeColumn=`tradingTime)

//Query stream calculation results:
select * from result

//If you want to call the above script repeatedly, first run the following three lines of script to clear the flow table subscription
unsubscribeTable(tableName=`tickStream, actionName="mTopN")
dropStreamEngine(`streamTopN)
undef(`tickStream, SHARED)

6. Use of DECIMAL

Versions 2.00.10 and later support the use of DECIMAL types (including DECIMAL32, 64, and 128 types) in TopN series functions. Not only the sort field S, but also the computed fields X and Y can use the DECIMAL type. If the calculated field is of DECIMAL type, the three functions msumTopN, tmsumTopN and cumsumTopN return the DECIMAL type, and the results of other functions still return the DOUBLE type.

While var, varp, std, stp, corr, covar , beta, wsum and other 8 basic functions correspond to the TopN series of functions whose final results are of DOUBLE type, but when the calculated columns X and Y are of DECIMAL type, The intermediate result of the calculation is represented by DECIMAL128, which can avoid the loss of precision. Of course, using DECIMAL for intermediate results of calculations also has disadvantages. Firstly, the calculation will take longer, and secondly, overflow may occur. The current version does not throw an exception when calculation overflow occurs, which requires special attention.

The number of significant digits of DECIMAL128 is 38 (including the digits before and after the decimal point). For example, to calculate the variance of price data, the data of 18.2345 has a total of 6 valid data, and 12 digits after square, if there are 100 million numbers (8 digits), the total effective digits are 20 digits, far lower than 38, not overflow will appear. But if there are too many decimal places, for example, the number of decimal places increases from 4 to 15, so that the effective number of price data is 17, and 34 after squared, 10,000 numbers may exceed the 38 effective numbers of DECIMAL128. When encountering data with a particularly large number of decimal places, either convert it to DOUBLE type for processing, or use functions such as decimal32, decimal64, and decimal128 to first reduce the data precision.

7. Summary

This tutorial introduces the calculation rules of TopN, and the application of 36 built-in mTopN, tmTopN, cumTopN series functions in offline computing and stream computing scenarios, and specifically introduces the implementation method of custom TopN.

In future versions, DolphinDB will support more built-in TopN calculation functions.