Curiosity · Concept
Multi-Latent Attention (MLA)
Multi-Latent Attention (MLA) is the attention scheme at the core of DeepSeek-V2, DeepSeek-V3, and R1. Instead of caching full keys and values per head, MLA down-projects them into a small shared latent vector, caches that, and up-projects on the fly during attention. The KV cache becomes roughly the size of a single small vector per token, which makes extremely long contexts and high-throughput inference practical on modest hardware. Decoupled rotary projections handle positional information separately so the latent compression doesn't fight RoPE.
Quick reference
- Proficiency
- Advanced
- Also known as
- MLA, multi-head latent attention
- Prerequisites
- Self-attention, KV cache, RoPE, Grouped-query attention
Frequently asked questions
What is Multi-Latent Attention?
MLA is an attention variant that projects keys and values into a shared low-rank latent space and caches only that latent. At attention time the latent is up-projected back to per-head K/V for the computation.
How does MLA compare to MQA and GQA?
MQA shares one K/V across all heads; GQA shares one K/V per group. MLA instead compresses K/V into a latent bottleneck. Empirically DeepSeek reports MLA matches MHA quality with a smaller cache than MQA.
Why does MLA need a decoupled RoPE branch?
RoPE positional rotations don't commute with the low-rank decomposition, so DeepSeek splits K into a compressed content component plus a small rotary component handled separately. This preserves positional information without blowing up the cache.
Which models use MLA?
DeepSeek-V2 introduced MLA, and it's used throughout the DeepSeek-V3 and R1 family. Its KV-cache efficiency is a key reason DeepSeek can serve very long contexts cost-effectively.
Sources
- DeepSeek-AI — DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — accessed 2026-04-20
- DeepSeek-AI — DeepSeek-V3 Technical Report — accessed 2026-04-20