Curiosity · Concept

Multi-Latent Attention (MLA)

Multi-Latent Attention (MLA) is the attention scheme at the core of DeepSeek-V2, DeepSeek-V3, and R1. Instead of caching full keys and values per head, MLA down-projects them into a small shared latent vector, caches that, and up-projects on the fly during attention. The KV cache becomes roughly the size of a single small vector per token, which makes extremely long contexts and high-throughput inference practical on modest hardware. Decoupled rotary projections handle positional information separately so the latent compression doesn't fight RoPE.

Quick reference

Proficiency: Advanced
Also known as: MLA, multi-head latent attention
Prerequisites: Self-attention, KV cache, RoPE, Grouped-query attention

Frequently asked questions

What is Multi-Latent Attention?

MLA is an attention variant that projects keys and values into a shared low-rank latent space and caches only that latent. At attention time the latent is up-projected back to per-head K/V for the computation.

How does MLA compare to MQA and GQA?

MQA shares one K/V across all heads; GQA shares one K/V per group. MLA instead compresses K/V into a latent bottleneck. Empirically DeepSeek reports MLA matches MHA quality with a smaller cache than MQA.

Why does MLA need a decoupled RoPE branch?

RoPE positional rotations don't commute with the low-rank decomposition, so DeepSeek splits K into a compressed content component plus a small rotary component handled separately. This preserves positional information without blowing up the cache.

Which models use MLA?

DeepSeek-V2 introduced MLA, and it's used throughout the DeepSeek-V3 and R1 family. Its KV-cache efficiency is a key reason DeepSeek can serve very long contexts cost-effectively.

Sources

DeepSeek-AI — DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — accessed 2026-04-20
DeepSeek-AI — DeepSeek-V3 Technical Report — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related