Curiosity · Concept
Multi-Query Attention (MQA)
Multi-Query Attention (MQA), introduced by Shazeer in 2019, keeps multiple query heads but collapses keys and values to a single shared head. This cuts KV-cache size by the number of heads (often 32-64x), making long-context inference and high-throughput serving much cheaper. MQA powers PaLM and early Falcon models, but can degrade quality and training stability — which is why many newer models now use its relaxed cousin, Grouped-Query Attention.
Quick reference
- Proficiency
- Intermediate
- Also known as
- MQA, multi query attention
- Prerequisites
- Self-attention, Multi-head attention, KV cache
Frequently asked questions
What is Multi-Query Attention?
MQA is an attention variant where every query head shares a single key and value projection. Instead of N separate K/V heads, you have one K/V head and N query heads, shrinking the KV cache by a factor of N.
Why use MQA instead of standard multi-head attention?
During autoregressive decoding, every generated token must load the entire KV cache. MQA makes that cache N-times smaller, which dramatically speeds up inference and lets you serve longer contexts and bigger batches on the same GPU.
What's the downside of MQA?
Collapsing K/V to one head reduces expressivity. Empirical studies show small but real quality drops on perplexity and downstream tasks, and training can be less stable. GQA was designed specifically to fix this trade-off.
Which models use MQA?
PaLM (Google) and early Falcon models used MQA. Most newer open-weight LLMs like Llama 2 70B+ and Mistral use Grouped-Query Attention instead, which recovers most of the lost quality.
Sources
- Shazeer — Fast Transformer Decoding: One Write-Head is All You Need — accessed 2026-04-20
- Ainslie et al. — GQA paper (discusses MQA trade-offs) — accessed 2026-04-20