parquet – SyntaxBug

Python+Pickle/Parquet/HDF5…Comparison of quantization factor calculation performance under different file format storage modes

In quantitative trading, high-frequency factor calculation based on financial market L1/L2 quotations and transaction high-frequency data is a common investment research requirement. As the amount of financial market data continues to increase, traditional relational databases have been unable to meet the storage and query needs of large-scale data. In order to cope with this challenge, […]

Column storage engine-kernel mechanism-Parquet format

Column storage engine-kernel mechanism-Parquet format Parquet is an open source columnar storage structure that is widely used in the field of big data. 1. Data model and schema Parquet inherits the data model of Protocol Buffer. Each record consists of one or more fields. Each field can be an atomic field or a group field. […]

The data model and file format stored by Parquet

Article directory data model Parquet’s atomic types Parquet’s logical type nested coding Parquet file format The main reference for this article: Tom White. The Definitive Guide to Hadoop. 4th Edition. Tsinghua University Press, 2017.pages 363. Apache Parquet is a columnar storage format that can effectively store nested data and is widely used in Spark. The […]

Java writes data code example to Parquet file

background In the production environment, the amount of data reaches hundreds of millions every day, and it needs to be incrementally stored in the corresponding partition of hive. The plain text data occupies a relatively large amount of storage, so when the storage is tight, it is necessary to use the Parquet storage format to […]

datax plug-in development HdfsReader supports parquet

The data warehouse HIVE generally uses the parquet format for storing data, but the open source version of Alibaba datax does not support the parquet format. I checked a lot of information on the Internet, and most of the written information is incomplete. I hereby summarize the complete version and record it for your reference. […]

Flink’s FileSink writes data to parquet files

Flink’s FileSink writes data to parquet files You must use forBulkFormat when using FileSink to write data into columnar storage files, such as ORCFile, ParquetFile, here we will take ParquetFile as an example to illustrate with the code. In Flink1.15.3, the constructed ParquetWriterFactory is passed in by constructing ParquetWriterFactory and then calling the forBulkFormat method, […]

[Trino actual combat] ORC and Parquet query performance analysis under Trino

ORC and Parquet query performance analysis under Trino Environment OS: CentOS 6.5 JDK: 1.8 Memory: 256G Disk: HDD CPU: Dual 8-core Intel? Xeon? CPU (32 Hyper-Threads) E5-2630 v3 @ 2.40GHz HDFS: 2.9.2 Hive: 2.3.9 Trino: 418 Analyze the query efficiency of files in different formats with the help of Trino’s time-consuming query of files in […]

Spark SQL data source: Parquet file

Article directory 1. Introduction to Parquet Second, the method of reading and writing Parquet (1) Use the parquet() method to read the parquet file 1. Data preparation 2. Read the parquet file 3. Display the content of the data frame (2) Write the parquet file using the parquet() method 1. Write the parquet file 2. […]

Data storage method (Parquet, ORC)

Article directory How the data is stored store by row store by column Parquest file layout concept unit of parallel processing configuration Row Group Size The size of the row group Data Page Size data page size metadata data page Parquet experiment under Hive The use of Parquet simple tools Supported components Apache ORC file […]

Import Parquet file data into Hive, JSON file into ES

Article directory Import data from Parquet files into Hive Query parquet file format compile cli tools View metadata information Query sampled data Create hive table data storage format using parquet load file Import json data into ES ES batch import api Original json file content index structure Restructure the json script Restructured json file bulk […]