Home Knowledge Base Pachyderm

Pachyderm is the enterprise data versioning and pipeline orchestration platform for Kubernetes that combines Git-like data version control with automatically triggered containerized pipelines — providing complete data lineage for every model artifact by tracking which data commit, code version, and pipeline version produced each output, enabling reproducible and auditable ML workflows at scale.

What Is Pachyderm?

Why Pachyderm Matters for AI

Pachyderm Core Concepts

Repos and Commits (PFS):

Create a data repository

pachctl create repo training-data

Commit data files (like git commit)

pachctl start commit training-data@main pachctl put file training-data@main:/dataset.parquet -f local_dataset.parquet pachctl finish commit training-data@main

List commits (version history)

pachctl list commit training-data@main

Inspect specific commit

pachctl inspect commit training-data@abc123

Branch for experimentation

pachctl create branch training-data@experiment-v2 --head main pachctl start commit training-data@experiment-v2

... add modified data ...

pachctl finish commit training-data@experiment-v2

Pipelines (PPS):

preprocess_pipeline.yaml

pipeline: name: preprocess input: pfs: repo: training-data branch: main glob: "/*.parquet" # Process each .parquet file as a separate datum transform: image: mycompany/preprocessor:v2.1 cmd: ["python", "/code/preprocess.py"] env: OUTPUT_DIR: /pfs/out parallelism_spec: constant: 4 # 4 parallel workers

Create pipeline

pachctl create pipeline -f preprocess_pipeline.yaml

Automatic Triggering:

When new data committed to training-data@main:

→ Pachyderm automatically triggers preprocess pipeline

→ preprocess output committed to preprocess repo

→ train pipeline (monitoring preprocess) automatically triggers

→ Complete lineage tracked end-to-end without manual intervention

Querying Lineage:

What produced this output file?

pachctl inspect file model-output@main:/model.pkl

Shows: created by pipeline "train" version 3, from input commit abc123 of "preprocess" repo

Which was created from commit def456 of "training-data" repo

Pachyderm Deployment:

Deploy on Kubernetes using Helm

helm repo add pachyderm https://helm.pachyderm.com helm install pachyderm pachyderm/pachyderm --set deployTarget=AMAZON

Connect to cluster

pachctl connect grpc://pachd:30650

Pachyderm vs Alternatives

PlatformData VersioningAuto PipelinesLineageK8s NativeBest For
PachydermGit-like (PFS)YesExcellentYesAuditable enterprise ML
DVCGit-basedYAML pipelinesVia commitsNoDeveloper-friendly versioning
LakeFSGit-like (S3)NoLimitedNoData lake branching
DagsterAssetsYesGoodOptionalAsset-centric orchestration
AirflowNoYesLimitedOptionalGeneral workflow orchestration

Pachyderm is the enterprise data lineage and pipeline platform for ML teams that require complete, automatic audit trails of every data transformation and model artifact — by combining Git-like data versioning with automatically triggered Kubernetes-native pipelines, Pachyderm ensures that every output artifact — from preprocessed datasets to production models — can be traced back to its exact source data, code version, and pipeline configuration for regulatory compliance and reproducibility.

pachydermdata versioningpipeline

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.