Apache Iceberg

Apache Iceberg is the open table format for huge analytical datasets that provides ACID transactions, time travel, and schema evolution on top of object storage — originally created at Netflix to solve the reliability and performance problems of Hive Metastore partitioning at petabyte scale, now the engine-agnostic standard for data lakehouse table formats.

What Is Apache Iceberg?

- Definition: A high-performance table format specification for storing large analytical datasets in object storage — defining how table metadata (schemas, partitioning, snapshots) is stored alongside Parquet/ORC/Avro data files, enabling multiple compute engines to reliably read and write the same table.
- Origin: Created by Netflix engineers Ryan Blue and Daniel Davids to solve production problems with Hive Metastore — specifically the inability to atomically update petabyte-scale tables and the listing overhead of discovering which files belong to a query.
- Engine-Agnostic: Unlike Delta Lake (optimized for Spark/Databricks), Iceberg is a neutral specification — supported natively by Apache Spark, Trino, Presto, Apache Flink, Hive, DuckDB, and cloud engines like Athena, BigQuery Omni, and Snowflake.
- Catalog: Iceberg tables are tracked via a catalog (Hive Metastore, AWS Glue, Nessie, REST catalog) that stores the current metadata pointer — enabling atomic table updates that all engines see simultaneously.
- Adoption: Netflix, Apple, LinkedIn, Adobe, Expedia — production deployments at petabyte+ scale using Iceberg as the foundational table format.

Why Iceberg Matters for AI/ML

- Multi-Engine Flexibility: ML teams using Spark for training, Trino for exploration, and DuckDB for local analysis can all read the same Iceberg table — no vendor lock-in to a single compute engine.
- Hidden Partitioning: Iceberg partitions data transparently without requiring users to include partition columns in every query — the table format handles partition pruning automatically based on the query predicate.
- Time Travel for Reproducibility: Query training data as of any past snapshot — guaranteed to return identical results for model reproduction regardless of subsequent table modifications.
- Schema Evolution Without Rewrites: Add columns, rename columns, or change types in a large feature table without rewriting any data files — Iceberg handles column mapping between old and new schemas at read time.
- Row-Level Deletes: Iceberg v2 supports row-level position deletes and equality deletes — enabling GDPR compliance (delete a user's data) and CDC upserts on analytical tables.

Core Iceberg Features

Snapshot-Based Architecture:
- Every table write creates a new snapshot (immutable set of data files)
- Readers always see a consistent snapshot — no dirty reads during concurrent writes
- Snapshots retained for configurable period enabling time travel

Time Travel:
-- Query historical data
SELECT * FROM orders FOR SYSTEM_TIME AS OF TIMESTAMP '2024-01-01 00:00:00';
SELECT * FROM orders FOR SYSTEM_VERSION AS OF 5234567890;

-- Rollback table to previous snapshot
CALL catalog.system.rollback_to_snapshot('db.orders', 5234567890);

Partition Evolution:
-- Change partitioning strategy without rewriting data
ALTER TABLE orders REPLACE PARTITION FIELD year(order_date) WITH month(order_date);

Metadata Pruning:
- Column-level min/max statistics in manifest files
- Queries skip entire data files based on predicates without reading them
- Orders of magnitude faster than Hive for selective queries on large tables

Iceberg vs Alternatives

| Format | Engine Agnostic | Multi-Writer | Row Deletes | Best For |
|--------|----------------|-------------|-------------|---------|
| Iceberg | Yes | Yes (v2) | Yes (v2) | Multi-engine, open standard |
| Delta Lake | Partial | Yes | Yes | Databricks/Spark focus |
| Hudi | Partial | Yes | Yes | Streaming upserts |
| Hive | No | No | No | Legacy only |

Apache Iceberg is the open standard for analytical table formats that liberates data from single-engine lock-in — by defining a precise, engine-agnostic specification for storing metadata and data files, Iceberg enables any compute engine to reliably read, write, and time-travel on the same petabyte-scale tables with ACID guarantees.

Want to learn more?