lakeFS

lakeFS is the Git-for-data platform that adds branching, commits, and rollbacks directly to object storage (S3, GCS, Azure Blob) — enabling data engineers and ML teams to safely experiment with ETL pipelines on branches of production data, roll back failed jobs instantly, and maintain complete data lineage with the same workflow as Git-based software development.

What Is lakeFS?

- Definition: An open-source data lake versioning layer that sits as a proxy in front of object storage — transparently intercepting S3/GCS API calls and adding Git-like version control semantics (branches, commits, merges, rollbacks) without copying data.
- Zero-Copy Branching: Creating a branch of a petabyte-scale data lake is instantaneous — lakeFS records metadata about what files belong to the branch, only storing actual data when files are modified (copy-on-write).
- S3-Compatible API: Existing tools (Spark, Presto, Trino, Pandas, Athena) connect to lakeFS using their standard S3 configuration — just change the S3 endpoint URL to lakeFS, no code changes required.
- Use Case: When a data engineer wants to test a new ETL transformation without risking production data — create a branch, run the job, validate results, merge if correct, or discard the branch if the job corrupts data.
- Founded: 2020 by Einat Orr and Oz Katz — backed by a16z, designed to bring software engineering best practices to data engineering workflows.

Why lakeFS Matters for AI/ML

- Safe Experiment Infrastructure: ML teams can branch the feature store or training dataset, run feature engineering experiments, and merge only validated transformations — eliminating "who modified the training data?" incidents.
- Reproducibility: Every model training run can reference a specific lakeFS commit hash — guaranteeing the exact dataset used can be retrieved months later for debugging or auditing.
- Pipeline Testing: Test new Spark ETL jobs on a branch of production data — if the job produces incorrect output, discard the branch with zero data loss and zero cleanup effort.
- Multi-Team Isolation: Different data teams can work on the same data lake simultaneously on separate branches without stepping on each other's changes.
- Rollback: Data pipeline fails and corrupts a critical table? lakeFS rollback restores the previous commit state in seconds — no manual file recovery from backup.

Core lakeFS Concepts

Repository: A versioned data lake namespace in lakeFS — maps to one or more object storage buckets. Each repository has a default main branch.

Branches: Isolated namespaces within a repository. Creating a branch is instant and zero-copy — branch from main, modify files, merge back or discard.

Commits: Atomic snapshots of the entire branch state at a point in time — every commit has a hash, timestamp, committer, and message. Commits are immutable.

Merges: Merge a feature branch back to main after validating ETL output — lakeFS handles conflict detection and resolution.

Typical ML Workflow:
lakectl branch create repo/feature-v2 --source repo/main
# Run Spark ETL job writing to s3a://lakefs/repo/feature-v2/features/
spark-submit etl_job.py --output s3a://lakefs/repo/feature-v2/
# Validate output
python validate_features.py --branch feature-v2
# If valid, merge to main
lakectl merge repo/feature-v2 repo/main

Integration Points:
- Apache Spark: s3a://lakefs/ endpoint
- Presto/Trino: S3 catalog pointing to lakeFS
- Python: boto3 with lakeFS endpoint
- dbt: S3 profiles pointing to lakeFS
- CI/CD: GitHub Actions triggering data validation on branch commits

lakeFS vs Alternatives

| Tool | Versioning | Granularity | Ecosystem | Best For |
|------|-----------|------------|---------|---------|
| lakeFS | Full lake | File-level | S3-compatible | Data lake teams |
| Delta Lake | Table | Row-level | Spark-only | Databricks users |
| DVC | Pointers | File-level | Git + S3/GCS | ML dataset versioning |
| Pachyderm | Full pipeline | File-level | Kubernetes | Enterprise, lineage |

lakeFS is the Git layer for data lakes that brings software engineering discipline to data engineering — by making branching, testing, and rollback as natural for data pipelines as they are for application code, lakeFS eliminates the fear of experimenting on production data and makes data platform reliability a first-class engineering concern.

Want to learn more?