DVC (Data Version Control) is the Git-based data versioning system that tracks large files, datasets, and ML models using Git metadata while storing actual data in cloud storage — enabling ML teams to version multi-gigabyte training datasets and model weights alongside code in Git, reproduce any past experiment by checking out a specific commit, and build language-agnostic data pipelines defined in YAML that only rerun stages when inputs change.
What Is DVC?
- Definition: An open-source CLI tool (2017) that extends Git to handle large files by storing metadata pointers (.dvc files) in Git while pushing actual data (gigabytes to terabytes) to a configured remote storage (S3, GCS, Azure Blob, SFTP) — enabling data scientists to use familiar Git workflows (branches, commits, pull requests) for managing dataset and model versions.
- Core Mechanism: When you run dvc add dataset.parquet, DVC creates dataset.parquet.dvc (a small YAML file with the file's hash and size) and adds dataset.parquet to .gitignore. Commit the .dvc file to Git, push the actual data to DVC remote. Teammates run dvc pull to download the exact data version.
- Pipeline Tracking: DVC can define ML pipelines as a dvc.yaml file — each stage has defined inputs (deps), outputs (outs), and a command to run. DVC detects when a stage's inputs change and only reruns necessary stages, like a Makefile for ML pipelines.
- Git-Native: DVC works alongside Git without replacing it — the same branch model, the same commit history, the same pull request workflow. Switch to a Git branch → dvc pull → get the dataset version associated with that branch automatically.
- Storage Agnostic: DVC remote can be any cloud storage: S3, GCS, Azure Blob, SSH server, local network share, or even Google Drive — organizations use their existing data infrastructure as the DVC remote.
Why DVC Matters for AI
- Dataset Reproducibility: Git commits encode code version; DVC .dvc files encode data version. Together they fully specify an experiment — checkout commit + dvc pull restores the exact code AND data used for that training run.
- Large File Git Problem: Git cannot handle files larger than a few hundred MB — model checkpoints (1-70GB), training datasets (10GB-10TB), and embedding matrices break standard Git workflows. DVC solves this without abandoning Git.
- Collaboration: Teammates pull code with git pull and data with dvc pull using the same workflow — no manual S3 bucket navigation, no Confluence pages documenting "the correct dataset path," no naming conventions like dataset_v3_final_FINAL2.csv.
- Selective Downloads: dvc pull specific_file.dvc only downloads that file — avoid downloading a 1TB dataset when you only need one preprocessed split.
- CI/CD Integration: DVC commands work in CI/CD pipelines — GitHub Actions can run dvc repro to rebuild the model when data or code changes, automating retraining on dataset updates.
DVC Core Concepts
Tracking Data Files:
Track a large dataset
dvc add data/training_dataset.parquet git add data/training_dataset.parquet.dvc data/.gitignore git commit -m "Add training dataset v2" dvc push # Upload actual data to S3/GCS remote
Teammate reproduces:
git clone repo_url dvc pull # Downloads data from remote
Configuring Remote Storage: dvc remote add -d myremote s3://my-bucket/dvc-storage dvc remote modify myremote region us-east-1 git add .dvc/config && git commit -m "Configure S3 DVC remote"
DVC Pipelines (dvc.yaml): stages: preprocess: cmd: python preprocess.py --input data/raw.csv --output data/processed.parquet deps:
- data/raw.csv
- preprocess.py
outs:
- data/processed.parquet
train: cmd: python train.py --data data/processed.parquet --output models/model.pkl deps:
- data/processed.parquet
- train.py
- params.yaml
outs:
- models/model.pkl
metrics:
- metrics/scores.json
evaluate: cmd: python evaluate.py --model models/model.pkl --output metrics/scores.json deps:
- models/model.pkl
- test_data/
Running Pipelines: dvc repro # Rerun only stages with changed inputs dvc repro --force # Force rerun all stages dvc dag # Visualize pipeline as ASCII DAG
Experiment Tracking with DVC: dvc exp run --set-param train.learning_rate=0.001 # Run with modified param dvc exp run --set-param train.learning_rate=0.01 # Run variant
dvc exp show # Compare all experiments in table dvc exp diff # Diff metrics between experiments
Git Workflow Integration: git checkout feature/new-dataset dvc pull # Automatically gets data for this branch
git checkout main dvc pull # Switches to main branch data version
DVC vs Alternatives
| Tool | Git Integration | Pipeline | Storage | UI | Best For |
|---|---|---|---|---|---|
| DVC | Native | Yes (dvc.yaml) | Any | CLI/VSCode | Git-native teams |
| LakeFS | Git-like (separate) | No | S3/GCS/Azure | Web UI | Data lake branching |
| Pachyderm | No (own VCS) | Yes | Kubernetes PVC | Web UI | K8s-native versioning |
| MLflow Artifacts | No | No | Any | MLflow UI | Linked to experiments |
| W&B Artifacts | No | No | W&B cloud | W&B UI | Research teams |
DVC is the Git extension for ML that brings version control discipline to datasets and model artifacts — by enabling the same branch-commit-merge workflow that software engineers use for code to be applied to multi-gigabyte training data and model weights, DVC makes every ML experiment fully reproducible with a simple git checkout plus dvc pull.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.