Working with Parquet files

Parquet files offer significant advantages over traditional formats like CSV or JSON. This is more relevant in analytical workloads and processing. Tools like parquet-tools and DuckDB make it easy to create, manipulate, and query these files. parquet-tools https://github.com/hangxie/parquet-tools DuckDB https://duckdb.org/docs/stable/data/parquet/overview 1) parquet-tool display data to terminal in json (default) or jsonl, or csv parquet-tools cat data_file.parquet | jq . or display in jsonl only two lines/records parquet-tools cat --format jsonl --limit 2 data_file.parquet Get the meta data about the parquet file parquet-tools meta data_file.parquet **2) DuckDB DuckDB is an embedded SQL database that supports reading and writing Parquet files. Example: Generate a Parquet file: COPY (SELECT 'example' AS col1) TO 'data_file.parquet' (FORMAT 'parquet'); Read a Parquet file: SELECT * FROM read_parquet('data_file.parquet'); Lately, I have been using DuckDB for most of my analytics (dealing with Gigabytes of data) and it can handle both local and cloud-based files efficiently. What is the Parquet file format? Read about the Apache project's Overview/Motivation page https://parquet.apache.org/docs/overview/motivation/ and the project Documentation Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. For a good easy reading, go to this great blog article:https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html

Apr 5, 2025 - 18:39
 0
Working with Parquet files

Parquet files offer significant advantages over traditional formats like CSV or JSON. This is more relevant in analytical workloads and processing.

Tools like parquet-tools and DuckDB make it easy to create, manipulate, and query these files.

1) parquet-tool
display data to terminal in json (default) or jsonl, or csv

parquet-tools cat data_file.parquet | jq .

or

display in jsonl only two lines/records

parquet-tools cat --format jsonl --limit 2 data_file.parquet

Get the meta data about the parquet file

parquet-tools meta data_file.parquet

**2) DuckDB

DuckDB is an embedded SQL database that supports reading and writing Parquet files.

Example:

Generate a Parquet file:

COPY (SELECT 'example' AS col1) TO 'data_file.parquet' (FORMAT 'parquet');

Read a Parquet file:

SELECT * FROM read_parquet('data_file.parquet');

Lately, I have been using DuckDB for most of my analytics (dealing with Gigabytes of data) and it can handle both local and cloud-based files efficiently.

What is the Parquet file format?

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.