Capability · Framework — orchestration

DVC for LLM Pipelines

DVC is the canonical 'Git for data and models'. In LLM workflows it's used to version training datasets, fine-tuned checkpoints, evaluation datasets, and the exact pipeline that produced them. You describe stages (download data → preprocess → train → eval) in `dvc.yaml`, DVC caches artifacts in S3 / GCS / Azure / SSH, and `dvc repro` re-runs only the stages whose inputs changed — critical for expensive fine-tuning pipelines.

Framework facts

Category
orchestration
Language
Python
License
Apache-2.0
Repository
https://github.com/iterative/dvc

Install

pip install 'dvc[s3]'

Quickstart

dvc init
dvc stage add -n prepare -d data/raw -o data/clean python prepare.py
dvc stage add -n train  -d data/clean -o model/ python train.py
dvc repro  # reruns only changed stages

Alternatives

  • Pachyderm — data-centric pipelines
  • LakeFS — Git-like object store
  • MLflow — experiment tracking

Frequently asked questions

Why DVC for LLM fine-tuning?

Fine-tuning runs are expensive and usually depend on a chain of data prep + tokenisation + training + eval. DVC caches each stage's outputs keyed on inputs, so re-running after a small change only re-does what's necessary, saving hours of GPU time.

DVC or MLflow?

They're complementary. DVC handles versioning + pipelines, MLflow / W&B handle experiment tracking + model registry. Most teams use both.

Sources

  1. DVC docs — accessed 2026-04-20
  2. DVC GitHub — accessed 2026-04-20