Working with Parquet files
Parquet files offer significant advantages over traditional formats like CSV or JSON. This is more relevant in analytical workloads and processing. Tools like parquet-tools and DuckDB make it easy to create, manipulate, and query these files. parquet-tools https://github.com/hangxie/parquet-tools DuckDB https://duckdb.org/docs/stable/data/parquet/overview 1) parquet-tool display data to terminal in json (default) or jsonl, or csv parquet-tools cat data_file.parquet | jq . or display in jsonl only two lines/records parquet-tools cat --format jsonl --limit 2 data_file.parquet Get the meta data about the parquet file parquet-tools meta data_file.parquet **2) DuckDB DuckDB is an embedded SQL database that supports reading and writing Parquet files. Example: Generate a Parquet file: COPY (SELECT 'example' AS col1) TO 'data_file.parquet' (FORMAT 'parquet'); Read a Parquet file: SELECT * FROM read_parquet('data_file.parquet'); Lately, I have been using DuckDB for most of my analytics (dealing with Gigabytes of data) and it can handle both local and cloud-based files efficiently. What is the Parquet file format? Read about the Apache project's Overview/Motivation page https://parquet.apache.org/docs/overview/motivation/ and the project Documentation Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. For a good easy reading, go to this great blog article:https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html

Parquet files offer significant advantages over traditional formats like CSV or JSON. This is more relevant in analytical workloads and processing.
Tools like parquet-tools
and DuckDB make it easy to create, manipulate, and query these files.
parquet-tools
https://github.com/hangxie/parquet-tools
1) parquet-tool
display data to terminal in json (default) or jsonl, or csv
parquet-tools cat data_file.parquet | jq .
or
display in jsonl
only two lines/records
parquet-tools cat --format jsonl --limit 2 data_file.parquet
Get the meta data about the parquet file
parquet-tools meta data_file.parquet
**2) DuckDB
DuckDB is an embedded SQL database that supports reading and writing Parquet files.
Example:
Generate a Parquet file:
COPY (SELECT 'example' AS col1) TO 'data_file.parquet' (FORMAT 'parquet');
Read a Parquet file:
SELECT * FROM read_parquet('data_file.parquet');
Lately, I have been using DuckDB for most of my analytics (dealing with Gigabytes of data) and it can handle both local and cloud-based files efficiently.
What is the Parquet file format?
- Read about the Apache project's Overview/Motivation page https://parquet.apache.org/docs/overview/motivation/ and the project Documentation
Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.
- For a good easy reading, go to this great blog article:https://blog.matthewrathbone.com/2019/12/20/parquet-or-bust.html