Pachyderm

Pachyderm is the enterprise data versioning and pipeline orchestration platform for Kubernetes that combines Git-like data version control with automatically triggered containerized pipelines — providing complete data lineage for every model artifact by tracking which data commit, code version, and pipeline version produced each output, enabling reproducible and auditable ML workflows at scale.

What Is Pachyderm?

- Definition: An enterprise data platform running natively on Kubernetes that combines two core capabilities: PFS (Pachyderm File System) for Git-like versioning of large datasets, and PPS (Pachyderm Pipeline System) for containerized data transformation pipelines that automatically trigger when new data is committed.
- PFS (Data Versioning): A distributed file system built on top of object storage (S3, GCS, Azure Blob) with Git semantics — you can commit files, create branches, see diffs, and roll back to any previous commit across petabyte-scale datasets.
- PPS (Automated Pipelines): Pipelines are defined as JSON/YAML specifications that describe a Docker container, the input repository to monitor, and the command to run — when new data is committed to a monitored repo, Pachyderm automatically triggers the pipeline, running the transformation container against the new data.
- Data Lineage: Pachyderm's greatest strength — it maintains a complete, automatic audit trail linking every output file to the exact input data commit, code version (Docker image tag), and pipeline version that produced it. "This model.pkl was produced by pipeline v2.1 processing input_data commit #543."
- Enterprise Positioning: Pachyderm targets enterprise ML teams with strict audit and reproducibility requirements — financial services, healthcare, and government organizations that must prove exactly how AI outputs were generated for regulatory compliance.

Why Pachyderm Matters for AI

- Automatic Lineage: Every pipeline run is logged with complete provenance — without any manual tracking code, Pachyderm knows that output file X was produced by pipeline Y version Z processing input commit ABC. Audit any model artifact back to its source data instantly.
- Incremental Processing: Pachyderm pipelines only process new or changed data — when 1,000 new records arrive in the input repo, only those records are processed by downstream pipelines, not the full dataset. Efficient for continuously updated training data.
- Reproducibility: To reproduce any historical model, specify the data commit hash and the Docker image tag — Pachyderm reruns the exact pipeline configuration against the exact input data. Complete reproducibility without custom tracking code.
- Branch-Based Experimentation: Create a branch of the production data, apply experimental preprocessing, run model training — the experimental branch is isolated from production. Merge or discard based on results.
- Kubernetes-Native Scaling: Pipelines scale horizontally on Kubernetes — Pachyderm distributes input data across worker pods and merges outputs automatically, scaling preprocessing or training to available cluster capacity.

Pachyderm Core Concepts

Repos and Commits (PFS):
# Create a data repository
pachctl create repo training-data

# Commit data files (like git commit)
pachctl start commit training-data@main
pachctl put file training-data@main:/dataset.parquet -f local_dataset.parquet
pachctl finish commit training-data@main

# List commits (version history)
pachctl list commit training-data@main

# Inspect specific commit
pachctl inspect commit training-data@abc123

# Branch for experimentation
pachctl create branch training-data@experiment-v2 --head main
pachctl start commit training-data@experiment-v2
# ... add modified data ...
pachctl finish commit training-data@experiment-v2

Pipelines (PPS):
# preprocess_pipeline.yaml
pipeline:
name: preprocess
input:
pfs:
repo: training-data
branch: main
glob: "/*.parquet" # Process each .parquet file as a separate datum
transform:
image: mycompany/preprocessor:v2.1
cmd: ["python", "/code/preprocess.py"]
env:
OUTPUT_DIR: /pfs/out
parallelism_spec:
constant: 4 # 4 parallel workers

# Create pipeline
pachctl create pipeline -f preprocess_pipeline.yaml

Automatic Triggering:
# When new data committed to training-data@main:
# → Pachyderm automatically triggers preprocess pipeline
# → preprocess output committed to preprocess repo
# → train pipeline (monitoring preprocess) automatically triggers
# → Complete lineage tracked end-to-end without manual intervention

Querying Lineage:
# What produced this output file?
pachctl inspect file model-output@main:/model.pkl

# Shows: created by pipeline "train" version 3, from input commit abc123 of "preprocess" repo
# Which was created from commit def456 of "training-data" repo

Pachyderm Deployment:
# Deploy on Kubernetes using Helm
helm repo add pachyderm https://helm.pachyderm.com
helm install pachyderm pachyderm/pachyderm --set deployTarget=AMAZON

# Connect to cluster
pachctl connect grpc://pachd:30650

Pachyderm vs Alternatives

| Platform | Data Versioning | Auto Pipelines | Lineage | K8s Native | Best For |
|----------|----------------|---------------|---------|-----------|---------|
| Pachyderm | Git-like (PFS) | Yes | Excellent | Yes | Auditable enterprise ML |
| DVC | Git-based | YAML pipelines | Via commits | No | Developer-friendly versioning |
| LakeFS | Git-like (S3) | No | Limited | No | Data lake branching |
| Dagster | Assets | Yes | Good | Optional | Asset-centric orchestration |
| Airflow | No | Yes | Limited | Optional | General workflow orchestration |

Pachyderm is the enterprise data lineage and pipeline platform for ML teams that require complete, automatic audit trails of every data transformation and model artifact — by combining Git-like data versioning with automatically triggered Kubernetes-native pipelines, Pachyderm ensures that every output artifact — from preprocessed datasets to production models — can be traced back to its exact source data, code version, and pipeline configuration for regulatory compliance and reproducibility.

Want to learn more?