Apache Parquet

Apache Parquet is the columnar binary file format that has become the universal standard for storing large analytical datasets — achieving 2-10x compression ratios and 10-100x faster analytical query performance versus row-oriented formats like CSV by storing each column's data contiguously, enabling queries to read only the columns they need and skip entire row groups via column statistics.

What Is Apache Parquet?

- Definition: An open-source columnar storage format originally developed by Twitter and Cloudera — where instead of storing each row sequentially (CSV, Avro), Parquet stores all values of each column together, enabling highly efficient compression and analytical query pushdown.
- Origin: Created in 2013 to bring Dremel's columnar storage concepts to the Hadoop ecosystem — co-developed by Twitter and Cloudera as a neutral format compatible with any processing framework.
- Universal Adoption: Default storage format for Spark, Presto, Trino, Athena, BigQuery, Snowflake external tables, Delta Lake, Iceberg, and Hudi — effectively the universal language of big data analytics.
- Self-Describing: Schema embedded in file footer (using Thrift encoding) — readers automatically know column names, types, and encoding without external schema registry.
- Encoding: Multiple encoding strategies per column — dictionary encoding for low-cardinality columns, run-length encoding (RLE) for repetitive values, delta encoding for monotonic sequences — selected per column to maximize compression.

Why Parquet Matters for AI/ML

- Training Dataset Storage: Standard format for storing large ML training datasets on S3/GCS — efficiently compressed, compatible with every major ML framework and cloud service.
- Column Pruning: A model training job reading only "text" and "label" columns from a 500-column Parquet file reads only those 2 columns' data — IO reduced by 99.6%, critical for large-scale training dataset processing.
- Predicate Pushdown: Read a dataset of 1 billion rows but only rows where label == 1 — Parquet row group min/max statistics allow skipping entire row groups without decompression, reading only relevant data blocks.
- HuggingFace Datasets: HuggingFace stores all dataset shards in Parquet format — the standard way to distribute ML training data at scale with Arrow-compatible zero-copy loading.
- Feature Stores: Feature engineering pipelines write Parquet to S3; training jobs read specific feature columns via PyArrow with column pruning and predicate pushdown — efficient feature retrieval without loading entire tables.

Parquet File Structure

File Layout:
Row Group 1 (128MB default)
Column Chunk: user_id [min=1, max=1000000]
Page 1 (1MB): dictionary-encoded values
Page 2 (1MB): ...
Column Chunk: event_type [min="click", max="view"]
Page 1: RLE encoded
Column Chunk: embedding [512 floats per row]
Page 1: plain encoding
Row Group 2
...
File Footer: schema, row group statistics, column offsets
Magic bytes: PAR1

Reading Parquet in Python:
import pyarrow.parquet as pq

# Read only specific columns — skips all others
table = pq.read_table("dataset.parquet", columns=["text", "label"])

# Filter with predicate pushdown — skips row groups
table = pq.read_table(
"dataset.parquet",
filters=[("label", "=", 1), ("year", ">=", 2023)]
)

# Convert to Pandas or HuggingFace datasets
df = table.to_pandas()

Compression Codecs (Parquet supports multiple):
- Snappy: fast compress/decompress, moderate ratio — default for most tools
- Gzip: better ratio, slower — good for archival
- Zstd: best ratio + fast decompression — increasingly the modern default
- LZ4: fastest decompression — good for hot data

Parquet vs Other Formats

| Format | Orientation | Compression | Analytics | Streaming | Best For |
|--------|------------|-------------|-----------|-----------|---------|
| Parquet | Columnar | Excellent | Excellent | No | Analytics, ML datasets |
| Avro | Row | Good | Poor | Yes | Kafka, schema evolution |
| CSV | Row | None | Poor | Yes | Human-readable exchange |
| Arrow | Columnar | Good | Excellent | Yes | In-memory processing |
| ORC | Columnar | Excellent | Excellent | No | Hive/ORC ecosystem |

Apache Parquet is the universal columnar file format that makes big data analytics and large-scale ML training datasets practical — by storing data column-by-column with per-column compression and built-in statistics for query pushdown, Parquet enables ML pipelines to efficiently access exactly the data they need from datasets containing billions of rows and thousands of columns.

Want to learn more?