Curiosity · Concept
PagedAttention
A naive LLM server pre-allocates a contiguous KV-cache slab per request sized to the max possible length, wasting huge amounts of GPU memory when actual lengths vary. PagedAttention, introduced with vLLM by Kwon et al. (2023), instead allocates the KV cache in small fixed-size blocks (commonly 16 tokens per block) held in a GPU-side page table. Because blocks are non-contiguous, memory is used just-in-time, fragmentation drops near zero, and blocks can be shared across requests (e.g., identical system prompts). Combined with continuous batching it delivers 2-4× higher throughput on the same hardware and is now the baseline for high-performance serving.
Quick reference
- Proficiency
- Advanced
- Also known as
- paged KV cache
- Prerequisites
- kv cache, attention mechanism
Frequently asked questions
What is PagedAttention?
PagedAttention is a GPU memory-management technique from vLLM that stores each sequence's KV cache in fixed-size non-contiguous blocks referenced by a page table — inspired by OS virtual-memory paging. It eliminates the internal fragmentation of naive contiguous KV allocation.
Why is KV-cache memory management so critical?
At long contexts and large batches, KV cache often dominates GPU memory usage — larger than the model weights. Saving even 20-30% on KV memory directly translates to serving bigger batches, which is the main throughput lever.
How does PagedAttention enable prefix sharing?
Because KV blocks are indexed through a page table, two requests with an identical prefix (same system prompt, same retrieved context) can point to the same physical blocks. This is how vLLM-style systems get near-free prompt caching.
Do I need to use vLLM to get PagedAttention?
No — TensorRT-LLM, SGLang, and TGI implement equivalent paged KV cache schemes. The concept has become standard; the specific implementations differ in block size, eviction, and sharing policies.
Sources
- Kwon et al. — Efficient Memory Management for Large Language Model Serving with PagedAttention — accessed 2026-04-20
- vLLM documentation — accessed 2026-04-20