Delta Lake

Delta Lake is the open-source storage layer developed by Databricks that adds ACID transactions, time travel, and schema enforcement to Apache Spark data lakes — transforming unreliable data lake storage into a "Data Lakehouse" that combines the low-cost scalability of object storage with the data reliability guarantees of a traditional data warehouse.

What Is Delta Lake?

- Definition: An open-source storage framework that extends Parquet files on object storage (S3, ADLS, GCS) with a transaction log (_delta_log/) — recording every insert, update, delete, and schema change as an atomic operation, enabling ACID semantics on top of files.
- Transaction Log: The core innovation — a JSON-based write-ahead log stored alongside Parquet files that records exactly which files are part of each table version. Readers see a consistent snapshot even while writers are concurrently modifying the table.
- Data Lakehouse: Term coined by Databricks to describe the architecture Delta Lake enables — data stored cheaply in object storage (like a data lake) with full ACID reliability and SQL query performance (like a data warehouse).
- Open Source: Delta Lake is Apache-licensed and governed by the Linux Foundation — major contributors include Databricks, Microsoft, and Apple. Compatible with any Spark deployment, not just Databricks.
- Adoption: Default storage format for all Databricks workloads; also supported by Apache Spark, Trino, Presto, Hive, and the Delta Kernel for non-Spark engines.

Why Delta Lake Matters for AI/ML

- Training Data Reliability: ACID guarantees mean ML pipelines reading training data see consistent snapshots — no partial writes from concurrent ETL jobs corrupting feature tables mid-training.
- Time Travel for Experiments: Reproduce any model training run by querying the exact feature table state at a past timestamp — SELECT * FROM features TIMESTAMP AS OF '2024-01-15'.
- Schema Evolution: Add new feature columns to a training dataset table without breaking existing queries or rewriting all historical data — Delta Lake enforces schema on write and handles evolution gracefully.
- Unified Batch/Streaming: The same Delta table can simultaneously receive streaming inserts (from Kafka via Spark Structured Streaming) and serve batch training queries — enabling real-time feature updates.
- Change Data Feed: Delta Lake CDC tracks row-level changes — downstream feature pipelines can process only new/changed rows rather than reprocessing the entire table.

Core Delta Lake Features

ACID Transactions:
- Serializable isolation: concurrent writers do not corrupt each other
- Atomic commits: either all files are written and committed, or none are
- Crash recovery: incomplete writes are rolled back on next access

Time Travel:
-- Query data as it was 30 days ago
SELECT * FROM sales VERSION AS OF 50;
SELECT * FROM sales TIMESTAMP AS OF '2024-01-01';

-- Restore table to previous version
RESTORE TABLE sales TO VERSION AS OF 42;

Schema Enforcement and Evolution:
-- Delta rejects writes that don't match the schema
df.write.format("delta").mode("append").save("/path/to/table")

-- Enable schema evolution for safe column additions
df.write.option("mergeSchema", "true").format("delta").save(path)

MERGE (Upsert):
MERGE INTO target USING source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Delta Lake vs Competitors

| Format | ACID | Streaming | Engine Support | Best For |
|--------|------|-----------|---------------|---------|
| Delta Lake | Full | Yes | Spark, Trino | Databricks ecosystem |
| Apache Iceberg | Full | Yes | Any engine | Engine-agnostic |
| Apache Hudi | Full | Yes | Spark, Flink | Upsert-heavy workloads |
| Plain Parquet | None | No | Universal | Static analytical data |

Delta Lake is the storage layer that makes data lakes production-grade — by layering ACID transactions, time travel, and schema enforcement on top of Parquet files in object storage, Delta Lake eliminates the reliability problems that historically made raw data lakes unsuitable for business-critical analytics and ML training pipelines.

Want to learn more?