Curiosity · Concept

KV Cache

Without a KV cache, generating each new token would require recomputing attention over the full prefix — O(n^2) work repeated n times. The KV cache saves the key and value tensors for every past token and reuses them, so each new token is an incremental operation. KV cache memory is often the dominant cost in LLM serving.

Quick reference

Proficiency: Advanced
Also known as: KV caching, attention cache, key-value cache
Prerequisites: Self-attention, Transformer architecture

Frequently asked questions

What is the KV cache?

It is the stored tensor of keys and values for all previously-generated tokens. During autoregressive generation, the new token's query attends over this cache instead of recomputing keys and values for the full prefix — turning generation from O(n^2) per step into O(n).

Why does KV cache dominate LLM memory cost?

The cache scales linearly with sequence length and batch size, and each layer contributes. For a 70B model at fp16 with 128k context, the KV cache can exceed 40GB per request. This is why context length and batch throughput are coupled — bigger context means fewer concurrent requests.

What are GQA and MQA?

Multi-Query Attention (MQA) shares one key/value head across all query heads, shrinking the KV cache dramatically. Grouped-Query Attention (GQA) is a middle ground — several query heads share a K/V head. Llama 2-70B and Llama 3 use GQA; PaLM used MQA. Both trade a small quality cost for much smaller KV cache.

What is PagedAttention?

PagedAttention (Kwon et al., vLLM) treats the KV cache like virtual memory — blocks of fixed-size pages allocated on demand. This eliminates internal fragmentation and enables prefix sharing between requests, boosting serving throughput by 2-4x.

Sources

Shazeer — Fast Transformer Decoding (MQA) — accessed 2026-04-20
Ainslie et al. — GQA: Training Generalized Multi-Query Transformer Models — accessed 2026-04-20
Kwon et al. — Efficient Memory Management for LLM Serving with PagedAttention (vLLM) — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related