Home Knowledge Base lakeFS

lakeFS is the Git-for-data platform that adds branching, commits, and rollbacks directly to object storage (S3, GCS, Azure Blob) — enabling data engineers and ML teams to safely experiment with ETL pipelines on branches of production data, roll back failed jobs instantly, and maintain complete data lineage with the same workflow as Git-based software development.

What Is lakeFS?

Why lakeFS Matters for AI/ML

Core lakeFS Concepts

Repository: A versioned data lake namespace in lakeFS — maps to one or more object storage buckets. Each repository has a default main branch.

Branches: Isolated namespaces within a repository. Creating a branch is instant and zero-copy — branch from main, modify files, merge back or discard.

Commits: Atomic snapshots of the entire branch state at a point in time — every commit has a hash, timestamp, committer, and message. Commits are immutable.

Merges: Merge a feature branch back to main after validating ETL output — lakeFS handles conflict detection and resolution.

Typical ML Workflow: lakectl branch create repo/feature-v2 --source repo/main

Run Spark ETL job writing to s3a://lakefs/repo/feature-v2/features/

spark-submit etl_job.py --output s3a://lakefs/repo/feature-v2/

Validate output

python validate_features.py --branch feature-v2

If valid, merge to main

lakectl merge repo/feature-v2 repo/main

Integration Points:

lakeFS vs Alternatives

ToolVersioningGranularityEcosystemBest For
lakeFSFull lakeFile-levelS3-compatibleData lake teams
Delta LakeTableRow-levelSpark-onlyDatabricks users
DVCPointersFile-levelGit + S3/GCSML dataset versioning
PachydermFull pipelineFile-levelKubernetesEnterprise, lineage

lakeFS is the Git layer for data lakes that brings software engineering discipline to data engineering — by making branching, testing, and rollback as natural for data pipelines as they are for application code, lakeFS eliminates the fear of experimenting on production data and makes data platform reliability a first-class engineering concern.

lakefsdata lakeversion

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.