Curiosity · Concept

Batching and Continuous Batching

LLM inference is memory-bandwidth bound — the GPU loads weights from HBM for every decoded token, and those weights are enormous. Serving one request at a time wastes that bandwidth; static batching (wait for N requests, then run them together) helps but stalls fast requests behind slow ones because all sequences must finish before the batch returns. Continuous batching (also called iteration-level scheduling) instead makes batching decisions every decoder step: a finished sequence exits, a waiting request immediately fills its slot, padding is eliminated. vLLM, TensorRT-LLM, TGI, and SGLang all ship this; throughput gains of 5-20× over naive serving are routine.

Quick reference

Proficiency
Advanced
Also known as
iteration-level scheduling, dynamic batching
Prerequisites
LLM inference, kv cache

Frequently asked questions

What is continuous batching?

Continuous batching is an LLM serving technique that dynamically swaps requests in and out of the GPU batch at every decoding step. A finished sequence exits immediately, a waiting one takes its slot, and no request is blocked by another's length.

Why is it so much faster than static batching?

Static batching forces all requests in the batch to finish before returning and pads short ones to the longest length — wasting GPU cycles. Continuous batching eliminates both the pad waste and the head-of-line blocking, improving throughput by 5-20× on typical workloads.

How does it interact with PagedAttention?

PagedAttention manages KV cache memory in fixed-size blocks so that arriving and departing requests don't fragment GPU memory. Continuous batching is the scheduler, PagedAttention is the memory manager — they were co-designed in vLLM.

Are there downsides?

Slightly higher per-token overhead from scheduling and more complex memory management. Latency for a single request can actually go up in a heavily loaded system because its tokens share a batch. Tune max batch size and max waiting requests per workload.

Sources

  1. Yu et al. — Orca: A Distributed Serving System for Transformer-Based Generative Models — accessed 2026-04-20
  2. Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) — accessed 2026-04-20