Capability · Framework — fine-tuning

DeepSpeed

DeepSpeed is the library Microsoft Research built to train Turing-NLG, Megatron-Turing, and eventually Phi — a stack of memory and compute optimisations for very large models. Its signature feature is ZeRO (Zero Redundancy Optimizer), which shards optimizer state, gradients, and parameters across GPUs to fit models 10-100x bigger than fit on a single device. DeepSpeed-Inference adds TensorRT-LLM-style kernels for low-latency serving.

Framework facts

Category
fine-tuning
Language
Python / CUDA
License
Apache-2.0
Repository
https://github.com/deepspeedai/DeepSpeed

Install

pip install deepspeed

Quickstart

# ds_config.json — ZeRO-3 with CPU offload
# {
#   "zero_optimization": {"stage": 3, "offload_optimizer": {"device": "cpu"}},
#   "bf16": {"enabled": true}
# }

deepspeed --num_gpus 8 train.py --deepspeed ds_config.json

Alternatives

  • FSDP (PyTorch)
  • Megatron-LM
  • ColossalAI

Frequently asked questions

ZeRO-3 vs FSDP?

Both shard optimizer states, gradients, and parameters. FSDP is PyTorch-native and better integrated; ZeRO-3 has more features (CPU offload, NVMe offload, sequence parallelism) and is still better for extreme scale + memory-constrained training.

Do I call DeepSpeed directly?

You can, but most teams use it via `transformers` Trainer or Accelerate, which wrap the config generation and launcher.

Sources

  1. DeepSpeed docs — accessed 2026-04-20
  2. DeepSpeed GitHub — accessed 2026-04-20