Capability · Framework — fine-tuning

DeepSpeed

DeepSpeed is the library Microsoft Research built to train Turing-NLG, Megatron-Turing, and eventually Phi — a stack of memory and compute optimisations for very large models. Its signature feature is ZeRO (Zero Redundancy Optimizer), which shards optimizer state, gradients, and parameters across GPUs to fit models 10-100x bigger than fit on a single device. DeepSpeed-Inference adds TensorRT-LLM-style kernels for low-latency serving.

Framework facts

Category: fine-tuning
Language: Python / CUDA
License: Apache-2.0
Repository: https://github.com/deepspeedai/DeepSpeed

Install

pip install deepspeed

Quickstart

# ds_config.json — ZeRO-3 with CPU offload
# {
#   "zero_optimization": {"stage": 3, "offload_optimizer": {"device": "cpu"}},
#   "bf16": {"enabled": true}
# }

deepspeed --num_gpus 8 train.py --deepspeed ds_config.json

Alternatives

FSDP (PyTorch)
Megatron-LM
ColossalAI

Frequently asked questions

ZeRO-3 vs FSDP?

Both shard optimizer states, gradients, and parameters. FSDP is PyTorch-native and better integrated; ZeRO-3 has more features (CPU offload, NVMe offload, sequence parallelism) and is still better for extreme scale + memory-constrained training.

Do I call DeepSpeed directly?

You can, but most teams use it via `transformers` Trainer or Accelerate, which wrap the config generation and launcher.

Sources

DeepSpeed docs — accessed 2026-04-20
DeepSpeed GitHub — accessed 2026-04-20