Python+Pickle/Parquet/HDF5…Comparison of quantization factor calculation performance under different file format storage modes

In quantitative trading, high-frequency factor calculation based on financial market L1/L2 quotations and transaction high-frequency data is a common investment research requirement. As the amount of financial market data continues to increase, traditional relational databases have been unable to meet the storage and query needs of large-scale data. In order to cope with this challenge, some users choose distributed file systems and use formats such as Pickle, Feather, Npz, Hdf5, and Parquet to store data, and combine them with Python for quantitative financial calculations.

Although these storage solutions can support massive amounts of high-frequency data, there are also some problems, such as difficulty in data permission management, inconvenience in correlating different data, inconvenience in retrieval and query, and the need to improve performance through data redundancy. In addition, reading and calculating through Python also requires some time in data transmission.

In order to solve these problems, more and more securities firms and private equity institutions have begun to use DolphinDB as an analytical distributed time series database. DolphinDB provides efficient data storage and computing capabilities, making factor calculations for high-frequency data more convenient and efficient.

This article aims to compare the performance differences between DolphinDB integrated factor calculation solution and Python + various file storage. Through the comparison in this article, readers can understand the advantages of DolphinDB integrated factor calculation and make reasonable choices in practical applications.

Test basic environment

Software and hardware information

This test compares factor calculation implemented through Python + file storage and DolphinDB. in:

  • Python + file storage factor calculation solution relies on Numpy, Pandas, DolphinDB, Multiprocessing and other libraries.
  • The DolphinDB integrated factor calculation solution uses DolphinDB Server as the computing platform. This test used a single-node deployment method.

The hardware and software environment information required for testing is as follows:

  • Hardware environment
Hardware name Configuration information
CPU Intel(R) Xeon (R) Silver 4216 CPU @ 2.10GHz
Memory 128G
Hard Disk SSD 500G
  • Software environment
Software name Version information
Operating system CentOS Linux release 7.9 .2009 (Core)
DolphinDB V 2.00.9.8
Python V 3.7.6
Numpy V 1.20.2
Pandas V 1.3.5

Test data

Level 2 market data is currently the most complete and granular transaction information data in the domestic securities market. It covers investment targets such as stocks listed on the Shanghai and Shenzhen stock exchanges, tradable funds, and Shanghai and Shenzhen exchange indices.

This test selected all Level2 historical market snapshot data of the entire market on the trading day of 2021.02.01. The data contains 26,632 targets, with a total data volume of approximately 31 million. The initial data is stored in DolphinDB, and data files in Pickle, Parquet and other formats can be exported and generated from DolphinDB.

The snapshot table test data has a total of 55 fields in DolphinDB. Some fields are shown below:

< /table>

Some data examples are as follows:

Field name Data type
1 SecurityID SYMBOL
2 DateTime TIMESTAMP
3 BidPrice DOUBLE
4 BidOrderQty INT
5 …… ……
SecurityID DateTime BidPrice BidOrderQty
000155 2021.12.01T09:41:00.000 [29.3000,29.2900,29.2800,29.2700,29.2600,29.2500,29.2400,29.2300,29.2200,29. 2100] [3700,11000,1400,1700,300,600,3800,200,600,1700]
000688 2021.12.01T09:40:39.000 [13.5300,13.5100,13.5000,13.4800,13.4700,13.4500,13.4400,13.4200,13.4000,13.3800] [500,1200,102200,5500,700,47 000,1000,6500,18400 ,1000]

Test scenario

The data used in this test is all Level 2 historical market snapshot data of the trading day of 2021.12.01 in the whole market. Among them, the data in the four formats of Pickle, Parquet, Feather, and Hdf5 are stored in groups according to the underlying code, while the data in Npz format is All data are evenly divided into twelve groups and stored. The above storage methods are all designed to achieve optimal computing performance, without considering storage performance for the time being. In practice, you can choose different storage methods yourself. For example, the HDFStore( ) function can store multiple dataframes as one hdf5 file. The compression ratio is better, but the efficiency of concurrent reading and writing will be reduced. has declined.

Factor calculation and code implementation

This section is mainly divided into three parts: introduction of high-frequency factors (linear regression slope of average price of ten levels of buying and selling orders, net commission increase of ten levels), factor implementation in DolphinDB and factor implementation in Python.

High frequency factor

  • Linear regression slope of average price of ten levels of buying and selling orders

The average price of the ten levels of buying and selling orders is the sum of the ten levels of buying and selling orders divided by the sum of the ten levels of buying and selling orders:

The linear regression slope of the average price of the ten levels of buying and selling orders is the slope of the linear regression of the average price of the ten levels of buying and selling orders versus time t.

  • Ten levels of net commission purchase increase

The ten-level net commissioned purchase increase factor refers to the overall increase in the buyer’s funds within the effective ten-level range, that is, the sum of all changes in the purchase price:

Within the valid range of ten gears means that gears that are no longer within the range of ten gears will not be considered, which means that only the gears in the following range will be considered:

Factor implementation in DolphinDB

  • Linear regression slope of average price of ten levels of buying and selling orders

The parameters required to calculate the linear regression slope of the average price of the ten levels of buying and selling orders are four fields: OfferOrderQty, BidOrderQty, OfferPrice, and BidPrice, all of which are array vector data types, respectively representing the ten levels of buying and selling prices and the ten levels of order quantities. Use rowSum, a built-in aggregate function, to improve the efficiency of factor calculations. Get the slope of a sliding linear regression of factor values against time t via the linearTimeTrend function, which returns the intercept and slope of the linear regression. price.ffill().linearTimeTrend(lag1-1).at(1).nullFill(0).mavg(lag2, 1).nullFill(0) means to obtain the average price pair of ten levels of buying and selling orders The slope of the linear regression at time t.

@state
def level10_InferPriceTrend(bid, ask, bidQty, askQty, lag1=60, lag2=20){
inferPrice = (rowSum(bid*bidQty) + rowSum(ask*askQty))\(rowSum(bidQty) + rowSum(askQty))
price = iif(bid[0] <=0 or ask[0]<=0, NULL, inferPrice)
return price.ffill().linearTimeTrend(lag1-1).at(1).nullFill(0).mavg(lag2, 1).nullFill(0)
}
  • Ten levels of net commission purchase increase

The parameters required for the calculation of the ten-tier net commission purchase increase are BidOrderQty and BidPrice, both of which are array vector data types. First, the row alignment function rowAlign is used to realize the row alignment of the current ten prices and the previous ten prices, and then the rowAt and nullFill functions are obtained respectively. The commission volume and realized price of the corresponding gear are aligned, and the total change is finally calculated.

@state
def level10_Diff(price, qty, buy, lag=20){
        prevPrice = price.prev()
        left, right = rowAlign(price, prevPrice, how=iif(buy, "bid", "ask"))
        qtyDiff = (qty.rowAt(left).nullFill(0) - qty.prev().rowAt(right).nullFill(0))
        amtDiff = rowSum(nullFill(price.rowAt(left), prevPrice.rowAt(right)) * qtyDiff)
        return msum(amtDiff, lag, 1).nullFill(0)
}

Python medium factor implementation

  • Linear regression slope of average price of ten levels of buying and selling orders
def level10_InferPriceTrend(df, lag1=60, lag2=20):
    '''
    Linear regression slope of average price of ten levels of buying and selling orders
    :param df:
    :param lag1:
    :param lag2:
    :return:
    '''
    temp = df[["SecurityID","DateTime"]]
    temp["amount"] = 0.
    temp["qty"] = 0.
    for i in range(10):
        temp[f"bidAmt{i + 1}"] = df[f"BidPrice{i + 1}"].fillna(0.) * df[f"BidOrderQty{i + 1}"].fillna(0.)
        temp[f"askAmt{i + 1}"] = df[f"OfferPrice{i + 1}"].fillna(0.) * df[f"OfferOrderQty{i + 1}"].fillna(0.)
        temp["amount"] + = temp[f"bidAmt{i + 1}"] + temp[f"askAmt{i + 1}"]
        temp["qty"] + = df[f"BidOrderQty{i + 1}"].fillna(0.) + df[f"OfferOrderQty{i + 1}"].fillna(0.)
    temp["inferprice"] = temp["amount"] / temp["qty"]
    temp.loc[(temp.bidAmt1 <= 0) | (temp.askAmt1 <= 0), "inferprice"] = np.nan
    temp["inferprice"] = temp["inferprice"].fillna(method='ffill').fillna(0.)

    def f(x):
        n = len(x)
        x = np.array(x)
        y = np.array([i for i in range(1, n + 1)])
        return (n*sum(x*y) - sum(x)*sum(y)) / (n*sum(y*y) - sum(y)*sum(y))

    temp["inferprice"] = temp.groupby("SecurityID")["inferprice"].apply(lambda x: x.rolling(lag1 - 1, 1).apply(f))
    temp["inferprice"] = temp["inferprice"].fillna(0)
    temp["inferprice"] = temp.groupby("SecurityID")["inferprice"].apply(lambda x: x.rolling(lag2, 1).mean())
    return temp[["SecurityID","DateTime", "inferprice"]].fillna(0)
  • Ten levels of net commission purchase increase
def level10_Diff(df, lag=20):
    '''
    Ten levels of entrusted purchase increases
    :param df:
    :param lag:
    :return:
    '''
    temp = df[["SecurityID","DateTime"]]

    for i in range(10):
        temp[f"bid{i + 1}"] = df[f"BidPrice{i + 1}"].fillna(0)
        temp[f"bidAmt{i + 1}"] = df[f"BidOrderQty{i + 1}"].fillna(0) * df[f"BidPrice{i + 1}"].fillna(0)
        temp[f"prevbid{i + 1}"] = temp[f"bid{i + 1}"].shift(1).fillna(0)
        temp[f"prevbidAmt{i + 1}"] = temp[f"bidAmt{i + 1}"].shift(1).fillna(0)

    temp["bidMin"] = temp[[f"bid{i + 1}" for i in range(10)]].min(axis=1)
    temp["bidMax"] = temp[[f"bid{i + 1}" for i in range(10)]].max(axis=1)
    temp["prevbidMin"] = temp[[f"prevbid{i + 1}" for i in range(10)]].min(axis=1)
    temp["prevbidMax"] = temp[[f"prevbid{i + 1}" for i in range(10)]].max(axis=1)
    temp["pmin"] = temp[["bidMin", "prevbidMin"]].max(axis=1)
    temp["pmax"] = temp[["bidMax", "prevbidMax"]].max(axis=1)

    temp["amtDiff"] = 0.0
    for i in range(10):
        temp["amtDiff"] + = temp[f"bidAmt{i + 1}"]*((temp[f"bid{i + 1}"] >= temp["pmin"]) & amp;(temp[ f"bid{i + 1}"] <= temp["pmax"])).astype(int) - \
                        temp[f"prevbidAmt{i + 1}"]*((temp[f"prevbid{i + 1}"] >= temp["pmin"]) & amp;(temp[f"prevbid{i + 1} "] <= temp["pmax"])).astype(int)
    temp["amtDiff"] = temp.groupby("SecurityID")["amtDiff"].apply(lambda x: x.rolling(lag, 1).sum())
    return temp[["SecurityID","DateTime", "amtDiff"]].fillna(0)

Comparison of calculation results

Computing performance comparison

The data volume of Level 2 market snapshot data exceeds 10 G per day, so financial quantitative engineers are very concerned about the high-frequency factor calculation performance of Level 2 market snapshot data. Based on the foregoing, in this section we calculate the factors of “linear regression slope of average price of ten levels of buying and selling orders” and “increase of net commissioned buying of ten levels” from two perspectives: (1) different degrees of parallelism (2) different file storage forms Performance comparison. The calculated data is 26,632 bids, with a total of 31 million rows in one day. We adjusted different degrees of parallelism, tested the time-consuming factor calculation for each storage method when using different numbers of CPU cores, and compared it with the time-consuming calculation of DolphinDB’s integrated in-library. All tests are performed after clearing the operating system cache. The test results are as follows:

  • Table 1 16 core computing performance comparison:
Storage method\factor Ten levels of average price linear regression slope (s)/performance improvement (multiple) Ten levels Net commission purchase increase (s)/performance improvement (multiple)
DolphinDB 2.4 2.3
Python + pickle 254.3/104.8 105.9/45.4
Python + parquet 309.4/127.5 147.9/63.4
Python + feather 291.1/120.0 130.6/56.0
Python + Hdf5 281.4/116.0 132.0/56.6
Python + Npz 346.7/142.9 175.2/75.1
  • Table 2 8 core computing performance comparison:
Storage method\factor Ten levels of average price linear regression slope (s)/performance improvement (multiple) Ten levels Net commission purchase increase (s)/performance improvement (multiple)
DolphinDB 4.5 4.4
Python + pickle 489.3/102.8 231.6/60.0
Python + parquet 638.7/143.0 296.3/67.9
Python + feather 594.2/133.1 261.9/60.0
Python + Hdf5 563.4/126.2 294.6/67.5
Python + Npz 648.7/145.3 334.3/76.6
  • Table 3 4 core computing performance comparison:
Storage method\factor Ten levels of average price linear regression slope (s)/performance improvement (multiple) Ten levels Net commission purchase increase (s)/performance improvement (multiple)
DolphinDB 6.5 6.8
Python + pickle 1014.9/155.6 363.9/53.6
Python + parquet 1134.9/174.0 560.9/82.6
Python + feather 1040.6/159.6 479.7/70.7
Python + Hdf5 1021.2/156.6 437.4/64.4
Python + Npz 1260.3/193.2 638.8/94.1
  • Table 4 1 Core computing performance comparison:
Storage method\factor Ten levels of average price linear regression slope (s)/performance improvement (multiple) Ten levels Net commission purchase increase (s)/performance improvement (multiple)
DolphinDB 21.8 22.0
Python + pickle 3638.2/166.5 1461.8/66.3
Python + parquet 4450.4/203.7 1759.3/79.8
Python + feather 3994.0/182.8 1773.7/80.5
Python + Hdf5 3996.9/182.9 1774.5/80.5
Python + Npz 5031.4/230.3 2437.3/110.6

It can be seen from the comparison results that in this test, under different CPU cores and different file storage format dimensions, for the linear regression slope of the average price of ten levels of buying and selling orders, the integrated calculation of DolphinDB was higher than that of Python + various storage files. The improvement reaches nearly 200 times, and the average improvement is about 100 times. Considering the characteristics of the two calculation methods, the reasons are roughly as follows:

  • The reading efficiency of DolphinDB’s own data storage system is far better than Python’s reading of various file storage using common storage methods.
  • DolphinDB has built-in functions such as rowSum and LinearTimeTrend, which avoid the cumbersome for loop calculation method when calculating factors, thus saving a lot of calculation time.

Although reading data files in formats such as Pickle and Parquet can achieve targeted redundant storage or other targeted optimization from a technical perspective, it will also bring additional hardware resource costs, data usage and management costs. In comparison, DolphinDB’s own data storage system is more efficient, convenient and simple to use. Therefore, the calculation speed of the integrated factor calculation in the DolphinDB library in the complete factor data reading and calculation process is far superior to the factor calculation method of Python + various file storage.

From the comparison of computing performance, it is not difficult to find the following phenomena:

  • In terms of code implementation, DolphinDB’s in-library SQL calculation makes it easier to implement factor calculation calls and parallel calls.
  • In terms of parallel computing, DolphinDB can automatically use the currently available CPU resources, while Python scripts need to be implemented through parallel scheduling code, but it is easier to control the concurrency.
  • In terms of calculation speed, DolphinDB’s in-library calculations are between 50 and 200 times faster than the calculation methods of Python + various file storage.

Comparison of calculation accuracy

In the previous section, we compared the computing performance of different computing methods. The calculation speed of DolphinDB’s in-library factor calculation is far superior to the factor calculation methods of Python + various file storage. But the prerequisite for fast calculation is that the calculation results must be correct and consistent. This article imports Python + various file storage and DolphinDB factor calculation results into Dataframe in pandas for comparison. Taking Python + Pickle as an example, the calculation results are compared as shown in the figure below. All the results displayed are completely consistent.

from pandas.testing import assert_frame_equal
df_ddb = df_dolphindb.sort_values(by=["SecurityID","DateTime","amtDiff"]).reset_index(drop=True)
df_py = res_combined.sort_values(by=["SecurityID","DateTime","amtDiff"]).reset_index(drop=True)
df_ddb = df_ddb.round(4)
df_py = df_py.round(4)
assert_frame_equal(df_ddb, df_py, check_exact=False, rtol=1, atol=1e-06)
print(assert_frame_equal)
"""
For the assert_frame_equal function, no result is output when the comparison results are consistent, and an error message is returned when they are inconsistent.
"""

FAQ

How to reproduce the code of this article?

Deploy the project according to the following steps:

  • First step DolphinDB deployment

First we need to download DolphinDB and complete the single-node server deployment and construction. This step can be completed in both Linux and Windows systems. For details, please refer to the link below:

Single node deployment and upgrade (dolphindb.cn)

  • Step 2 DolphinDB client installation

Install and start the GUI or vsCode. After connecting to the server, you can reproduce the DolphinDB script. For details, refer to the link below:

Using DolphinDB database and client

DolphinDB VS Code plugin

  • Step 3: Run DolphinDB code and Python code
    • Execute the code in the screenshot to generate simulation data

  • Execute the file in the screenshot and export the data in DolphinDB to a csv format file

The corresponding csv file can be found in the corresponding directory of the server.

  • Run each file in Python to convert the exported csv format data into the corresponding type of data and save it to the specified path (take Pickle as an example)

The result is as follows:

  • Calculate factors in DolphinDB

  • Calculate the factor of the corresponding storage method in Python (take pickle as an example)

Summary

The comparison results of this article show that under the premise of using the same core number CPU, the integrated computing performance in the DolphinDB library is about 50 to 200 times that of the solution using Python to calculate factors, and the calculation results are exactly the same as the Python method.

  • In terms of data management and reading:
    • DolphinDB is a distributed data system designed specifically for big data analysis and processing. It uses an efficient columnar storage format to optimize I/O operations. As a result, DolphinDB is very performant for reading and writing large data sets, while the Python Pandas library, while also capable of handling large data sets, may face performance challenges when handling very large data sets.
    • DolphinDB memory management is very efficient and can handle data sets that exceed the memory size and is very economical in memory usage. This allows DolphinDB to provide more stability and higher performance when processing large data sets. In Python, processing large data sets may require more memory management and optimization work.
    • As an underlying database product, DolphinDB integrates rich API interfaces, has complete version compatibility features and permission control functions. However, file storage has difficulties in cross-platform use, poor compatibility between front and rear versions, and lacks permission control functions. DolphinDB is much better than files in terms of operation and maintenance. Storage method.
  • In terms of code implementation, DolphinDB’s in-library SQL calculation makes it easier to implement factor calculation calls and parallel calls. At the same time, DolphinDB uses mature built-in functions to greatly improve the performance of factor calculations, and the overall code volume can be reduced to half or more.
  • In terms of parallel computing, DolphinDB can automatically use the currently available CPU resources, while Python scripts need to be implemented through parallel scheduling code, but it is easier to control the degree of concurrency.

Overall, in a production environment, using DolphinDB for factor calculation and storage is far more efficient than using Python + various file storage methods.

Appendix

The comparison test in this tutorial uses the following test script:

Click the link to get the complete script: python + file storage and DolphinDB factor calculation performance

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 377633 people are learning the system