Capability · Framework — fine-tuning
DeepSpeed
DeepSpeed is the library Microsoft Research built to train Turing-NLG, Megatron-Turing, and eventually Phi — a stack of memory and compute optimisations for very large models. Its signature feature is ZeRO (Zero Redundancy Optimizer), which shards optimizer state, gradients, and parameters across GPUs to fit models 10-100x bigger than fit on a single device. DeepSpeed-Inference adds TensorRT-LLM-style kernels for low-latency serving.
Framework facts
- Category
- fine-tuning
- Language
- Python / CUDA
- License
- Apache-2.0
- Repository
- https://github.com/deepspeedai/DeepSpeed
Install
pip install deepspeed Quickstart
# ds_config.json — ZeRO-3 with CPU offload
# {
# "zero_optimization": {"stage": 3, "offload_optimizer": {"device": "cpu"}},
# "bf16": {"enabled": true}
# }
deepspeed --num_gpus 8 train.py --deepspeed ds_config.json Alternatives
- FSDP (PyTorch)
- Megatron-LM
- ColossalAI
Frequently asked questions
ZeRO-3 vs FSDP?
Both shard optimizer states, gradients, and parameters. FSDP is PyTorch-native and better integrated; ZeRO-3 has more features (CPU offload, NVMe offload, sequence parallelism) and is still better for extreme scale + memory-constrained training.
Do I call DeepSpeed directly?
You can, but most teams use it via `transformers` Trainer or Accelerate, which wrap the config generation and launcher.
Sources
- DeepSpeed docs — accessed 2026-04-20
- DeepSpeed GitHub — accessed 2026-04-20