Capability · Framework — orchestration
DVC for LLM Pipelines
DVC is the canonical 'Git for data and models'. In LLM workflows it's used to version training datasets, fine-tuned checkpoints, evaluation datasets, and the exact pipeline that produced them. You describe stages (download data → preprocess → train → eval) in `dvc.yaml`, DVC caches artifacts in S3 / GCS / Azure / SSH, and `dvc repro` re-runs only the stages whose inputs changed — critical for expensive fine-tuning pipelines.
Framework facts
- Category
- orchestration
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/iterative/dvc
Install
pip install 'dvc[s3]' Quickstart
dvc init
dvc stage add -n prepare -d data/raw -o data/clean python prepare.py
dvc stage add -n train -d data/clean -o model/ python train.py
dvc repro # reruns only changed stages Alternatives
- Pachyderm — data-centric pipelines
- LakeFS — Git-like object store
- MLflow — experiment tracking
Frequently asked questions
Why DVC for LLM fine-tuning?
Fine-tuning runs are expensive and usually depend on a chain of data prep + tokenisation + training + eval. DVC caches each stage's outputs keyed on inputs, so re-running after a small change only re-does what's necessary, saving hours of GPU time.
DVC or MLflow?
They're complementary. DVC handles versioning + pipelines, MLflow / W&B handle experiment tracking + model registry. Most teams use both.
Sources
- DVC docs — accessed 2026-04-20
- DVC GitHub — accessed 2026-04-20