Curiosity · Concept

LLM KV-Cache Compression

During autoregressive decoding, LLMs cache the key and value vectors of every previous token so attention doesn't recompute them. At long context lengths this KV cache dominates GPU memory — often larger than the weights themselves. KV-cache compression covers four main levers: quantization (INT8/INT4 per-head scales), eviction (discard old or low-attention tokens, e.g., StreamingLLM, H2O), low-rank decomposition (DeepSeek's MLA), and token pruning/merging. In production these techniques are stacked to unlock 100k-plus token contexts cost-effectively.

Quick reference

Proficiency
Advanced
Also known as
KV cache quantization, KV cache eviction
Prerequisites
KV cache, Attention mechanism, Quantization

Frequently asked questions

Why is KV-cache compression important?

At 32k+ token contexts, the KV cache can be several times larger than the model weights. Shrinking it is the single biggest lever for long-context and high-batch inference cost.

What are the main KV-cache compression techniques?

Quantization (store K/V in INT8 or INT4), eviction (drop old / low-attention tokens — StreamingLLM, H2O, Scissorhands), low-rank compression (DeepSeek's MLA), and token merging / pruning.

Does KV-cache compression hurt quality?

INT8 KV quantization is near-lossless. INT4 and aggressive eviction can degrade quality on tasks that need distant context. Method choice depends on task: short-horizon chat tolerates more compression than long-document QA.

How does MLA fit into this?

DeepSeek's Multi-Latent Attention is essentially a learned low-rank compression of K/V: only a small latent vector per token is cached, and the model is trained end-to-end to work with it, giving near-MHA quality.

Sources

  1. Xiao et al. — Efficient Streaming Language Models with Attention Sinks (StreamingLLM) — accessed 2026-04-20
  2. Zhang et al. — H2O: Heavy-Hitter Oracle for Efficient Generative Inference of LLMs — accessed 2026-04-20