Apache Hudi

Apache Hudi is the open-source data lakehouse platform created at Uber for efficient upserts and incremental processing on large datasets stored in object storage — solving the specific challenge of applying real-time database changes (inserts, updates, deletes) to massive Parquet-based data lakes without rewriting entire partitions on every change.

What Is Apache Hudi?

- Definition: A data lake storage framework that provides efficient upsert (update + insert) and delete operations on large datasets stored in HDFS or object storage — using a record-level index to locate which file contains a specific record and updating only that file rather than rewriting entire partitions.
- Origin: Created at Uber in 2016 to solve the "How do we apply driver payment updates and trip corrections to our 100TB+ data lake in near real-time?" problem — donated to Apache in 2019.
- Record Index: Hudi maintains a record-level index (HBase or in-file) mapping each record key to its physical file location — enabling point updates to individual records without full partition rewrites.
- Table Types: Hudi offers two table types optimized for different access patterns: Copy-on-Write (COW) for read-heavy workloads and Merge-on-Read (MOR) for write-heavy streaming use cases.
- Incremental Queries: Consumers can query "What records changed in the last 15 minutes?" rather than reprocessing the entire table — critical for streaming ETL pipelines and real-time ML feature updates.

Why Hudi Matters for AI/ML

- Real-Time Feature Updates: Update individual user features (latest purchase, recent click, current balance) in the feature store within minutes of the triggering event — Hudi's upsert handles the "update this one record" operation efficiently.
- Streaming Ingestion: Kafka → Spark Structured Streaming → Hudi table pipeline: continuously ingests CDC events from databases into a queryable analytical table updated in near-real-time.
- Incremental Training: ML pipelines can consume only new/changed records from Hudi tables since the last training run — avoiding reprocessing terabytes of historical data to incorporate daily updates.
- GDPR Compliance: Delete a specific user's records across all Hudi tables without partition rewrites — Hudi's delete operation marks records as deleted in the index and filters them from queries.
- Time Travel: Audit training data state at any past point — Hudi maintains timeline metadata enabling point-in-time queries for debugging model drift.

Core Hudi Concepts

Table Types:

Copy-on-Write (COW):
- Writes rewrite affected Parquet files with updates applied
- Read-optimized: readers always see clean Parquet files
- Write amplification: expensive for high-frequency updates
- Best for: analytics workloads with infrequent updates

Merge-on-Read (MOR):
- Writes append delta log files (Avro format) rather than rewriting Parquet
- Reads merge base Parquet with delta logs on the fly
- Write-optimized: extremely fast ingestion for streaming
- Best for: streaming CDC ingestion, near-real-time use cases

Hudi Timeline (Transaction Log):
- Ordered sequence of actions: commit, compaction, clean, rollback
- Every committed instant is immutable with timestamp, action type, and state
- Incremental queries specify a start instant to get only subsequent changes

Incremental Query Pattern:
hudi_df = spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", "20240101000000")
.load("/path/to/hudi/table")

Compaction:
- MOR tables periodically compact delta logs back into Parquet base files
- Scheduled as async background job to avoid blocking ingestion
- Reduces read-time merge overhead as delta logs accumulate

Hudi vs Alternatives

| Feature | Hudi | Delta Lake | Iceberg |
|---------|------|-----------|---------|
| Upsert efficiency | Best (record index) | Good | Good |
| Streaming native | Yes (MOR) | Yes | Yes |
| Incremental queries | Native | CDC feed | Incremental scan |
| Engine support | Spark, Flink | Spark, Trino | All major engines |

Apache Hudi is the streaming-first data lakehouse platform that makes real-time upserts on massive datasets practical — by maintaining a record-level index and providing both copy-on-write and merge-on-read table types, Hudi enables ML teams to build near-real-time feature stores and continuously updated training datasets on top of object storage without the prohibitive cost of full-partition rewrites.

Want to learn more?