Python+Pickle/Parquet/HDF5…Comparison of quantization factor calculation performance under different file format storage modes

In quantitative trading, high-frequency factor calculation based on financial market L1/L2 quotations and transaction high-frequency data is a common investment research requirement. As the amount of financial market data continues to increase, traditional relational databases have been unable to meet the storage and query needs of large-scale data. In order to cope with this challenge, […]

The data model and file format stored by Parquet

Article directory data model Parquet’s atomic types Parquet’s logical type nested coding Parquet file format The main reference for this article: Tom White. The Definitive Guide to Hadoop. 4th Edition. Tsinghua University Press, 2017.pages 363. Apache Parquet is a columnar storage format that can effectively store nested data and is widely used in Spark. The […]

Java writes data code example to Parquet file

background In the production environment, the amount of data reaches hundreds of millions every day, and it needs to be incrementally stored in the corresponding partition of hive. The plain text data occupies a relatively large amount of storage, so when the storage is tight, it is necessary to use the Parquet storage format to […]

Flink’s FileSink writes data to parquet files

Flink’s FileSink writes data to parquet files You must use forBulkFormat when using FileSink to write data into columnar storage files, such as ORCFile, ParquetFile, here we will take ParquetFile as an example to illustrate with the code. In Flink1.15.3, the constructed ParquetWriterFactory is passed in by constructing ParquetWriterFactory and then calling the forBulkFormat method, […]

Spark SQL data source: Parquet file

Article directory 1. Introduction to Parquet Second, the method of reading and writing Parquet (1) Use the parquet() method to read the parquet file 1. Data preparation 2. Read the parquet file 3. Display the content of the data frame (2) Write the parquet file using the parquet() method 1. Write the parquet file 2. […]

Data storage method (Parquet, ORC)

Article directory How the data is stored store by row store by column Parquest file layout concept unit of parallel processing configuration Row Group Size The size of the row group Data Page Size data page size metadata data page Parquet experiment under Hive The use of Parquet simple tools Supported components Apache ORC file […]

Import Parquet file data into Hive, JSON file into ES

Article directory Import data from Parquet files into Hive Query parquet file format compile cli tools View metadata information Query sampled data Create hive table data storage format using parquet load file Import json data into ES ES batch import api Original json file content index structure Restructure the json script Restructured json file bulk […]