Home Knowledge Base Apache Hudi

Apache Hudi is the open-source data lakehouse platform created at Uber for efficient upserts and incremental processing on large datasets stored in object storage — solving the specific challenge of applying real-time database changes (inserts, updates, deletes) to massive Parquet-based data lakes without rewriting entire partitions on every change.

What Is Apache Hudi?

Why Hudi Matters for AI/ML

Core Hudi Concepts

Table Types:

Copy-on-Write (COW):

Merge-on-Read (MOR):

Hudi Timeline (Transaction Log):

Incremental Query Pattern: hudi_df = spark.read.format("hudi") .option("hoodie.datasource.query.type", "incremental") .option("hoodie.datasource.read.begin.instanttime", "20240101000000") .load("/path/to/hudi/table")

Compaction:

Hudi vs Alternatives

FeatureHudiDelta LakeIceberg
Upsert efficiencyBest (record index)GoodGood
Streaming nativeYes (MOR)YesYes
Incremental queriesNativeCDC feedIncremental scan
Engine supportSpark, FlinkSpark, TrinoAll major engines

Apache Hudi is the streaming-first data lakehouse platform that makes real-time upserts on massive datasets practical — by maintaining a record-level index and providing both copy-on-write and merge-on-read table types, Hudi enables ML teams to build near-real-time feature stores and continuously updated training datasets on top of object storage without the prohibitive cost of full-partition rewrites.

hudistreamingincremental

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.