Curiosity · Concept
Grouped-Query Attention (GQA)
Grouped-Query Attention (GQA) sits between vanilla Multi-Head Attention (MHA), where every query head has its own key/value projection, and Multi-Query Attention (MQA), where all query heads share one K/V pair. GQA splits the query heads into groups and gives each group a shared K/V head. The result is a much smaller KV cache at inference time, which is the bottleneck for long-context and high-batch serving, while preserving most of the modelling capacity of full MHA. Llama 2 70B, Llama 3, and Mistral popularised the technique.
Quick reference
- Proficiency
- Intermediate
- Also known as
- GQA, grouped query attention
- Prerequisites
- Self-attention, Multi-head attention, KV cache
Frequently asked questions
What is Grouped-Query Attention?
GQA is an attention variant where several query heads share one key/value head. For 32 query heads with 8 KV groups, every 4 query heads share one K/V projection — shrinking KV-cache memory by 4x with minimal quality loss.
How does GQA differ from MHA and MQA?
MHA has one K/V per query head (biggest cache, full quality). MQA uses one K/V for all heads (smallest cache, quality drop). GQA groups heads so you get ~MQA-sized cache with ~MHA-level quality.
Why does GQA matter for inference?
At long contexts, the KV cache dominates GPU memory and bandwidth. Reducing K/V heads by 4-8x directly shrinks the cache, lets you serve longer contexts, run larger batches, and lowers cost per token.
Do I need to retrain a model to use GQA?
Yes for full training, but Ainslie et al. (2023) showed you can 'uptrain' an existing MHA model into GQA with a small fraction of the pre-training compute by averaging K/V heads and fine-tuning briefly.
Sources
- Ainslie et al. — GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — accessed 2026-04-20
- Touvron et al. — Llama 2 — accessed 2026-04-20