Curiosity · Concept
Model Parallelism (Tensor and Pipeline)
Frontier LLMs have hundreds of billions of parameters and do not fit on a single GPU even at inference. Tensor parallelism, introduced at scale by Megatron-LM, shards each weight matrix column- or row-wise across devices; every GPU computes its slice of a matmul and an all-reduce combines results. Pipeline parallelism instead assigns different layers to different GPUs and streams micro-batches through the pipeline to keep everyone busy. Real-world training stacks (Megatron-DeepSpeed, Llama training, ZeRO) compose tensor + pipeline + data parallelism into 3D parallelism across thousands of GPUs. Inference systems use similar primitives, plus expert parallelism for MoE.
Quick reference
- Proficiency
- Advanced
- Also known as
- tensor parallelism, pipeline parallelism, 3D parallelism
- Prerequisites
- transformer-decoder-only, GPU basics
Frequently asked questions
What is model parallelism?
Model parallelism is any technique that splits a single neural network across multiple GPUs because the model is too large (or too slow) to run on one. The two main flavors are tensor parallelism (sharding individual matmuls) and pipeline parallelism (assigning different layers to different GPUs).
Tensor parallelism vs pipeline parallelism?
Tensor parallelism splits each weight matrix across GPUs in the same layer — low latency but requires fast interconnect (NVLink) and every matmul incurs an all-reduce. Pipeline parallelism puts whole layers on different GPUs and pipelines micro-batches — less bandwidth-hungry but suffers from pipeline bubbles.
How is it different from data parallelism?
Data parallelism replicates the full model on every GPU and splits the batch. That's fine until the model doesn't fit. When it doesn't, you need model parallelism; in practice large training jobs combine both (plus pipeline and expert parallelism).
What is 3D parallelism?
The combination of tensor, pipeline, and data parallelism used in frameworks like Megatron-DeepSpeed to train frontier-scale models across thousands of GPUs. Each dimension scales differently, and the product hits the compute budget no single dimension could.