Home Knowledge Base DVC (Data Version Control)

DVC (Data Version Control) is the Git-based data versioning system that tracks large files, datasets, and ML models using Git metadata while storing actual data in cloud storage — enabling ML teams to version multi-gigabyte training datasets and model weights alongside code in Git, reproduce any past experiment by checking out a specific commit, and build language-agnostic data pipelines defined in YAML that only rerun stages when inputs change.

What Is DVC?

Why DVC Matters for AI

DVC Core Concepts

Tracking Data Files:

Track a large dataset

dvc add data/training_dataset.parquet git add data/training_dataset.parquet.dvc data/.gitignore git commit -m "Add training dataset v2" dvc push # Upload actual data to S3/GCS remote

Teammate reproduces:

git clone repo_url dvc pull # Downloads data from remote

Configuring Remote Storage: dvc remote add -d myremote s3://my-bucket/dvc-storage dvc remote modify myremote region us-east-1 git add .dvc/config && git commit -m "Configure S3 DVC remote"

DVC Pipelines (dvc.yaml): stages: preprocess: cmd: python preprocess.py --input data/raw.csv --output data/processed.parquet deps:

outs:

train: cmd: python train.py --data data/processed.parquet --output models/model.pkl deps:

outs:

metrics:

evaluate: cmd: python evaluate.py --model models/model.pkl --output metrics/scores.json deps:

Running Pipelines: dvc repro # Rerun only stages with changed inputs dvc repro --force # Force rerun all stages dvc dag # Visualize pipeline as ASCII DAG

Experiment Tracking with DVC: dvc exp run --set-param train.learning_rate=0.001 # Run with modified param dvc exp run --set-param train.learning_rate=0.01 # Run variant

dvc exp show # Compare all experiments in table dvc exp diff # Diff metrics between experiments

Git Workflow Integration: git checkout feature/new-dataset dvc pull # Automatically gets data for this branch

git checkout main dvc pull # Switches to main branch data version

DVC vs Alternatives

ToolGit IntegrationPipelineStorageUIBest For
DVCNativeYes (dvc.yaml)AnyCLI/VSCodeGit-native teams
LakeFSGit-like (separate)NoS3/GCS/AzureWeb UIData lake branching
PachydermNo (own VCS)YesKubernetes PVCWeb UIK8s-native versioning
MLflow ArtifactsNoNoAnyMLflow UILinked to experiments
W&B ArtifactsNoNoW&B cloudW&B UIResearch teams

DVC is the Git extension for ML that brings version control discipline to datasets and model artifacts — by enabling the same branch-commit-merge workflow that software engineers use for code to be applied to multi-gigabyte training data and model weights, DVC makes every ML experiment fully reproducible with a simple git checkout plus dvc pull.

dvcdata versiongit

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.