Curiosity · Concept

Grouped-Query Attention (GQA)

Grouped-Query Attention (GQA) sits between vanilla Multi-Head Attention (MHA), where every query head has its own key/value projection, and Multi-Query Attention (MQA), where all query heads share one K/V pair. GQA splits the query heads into groups and gives each group a shared K/V head. The result is a much smaller KV cache at inference time, which is the bottleneck for long-context and high-batch serving, while preserving most of the modelling capacity of full MHA. Llama 2 70B, Llama 3, and Mistral popularised the technique.

Quick reference

Proficiency: Intermediate
Also known as: GQA, grouped query attention
Prerequisites: Self-attention, Multi-head attention, KV cache

Frequently asked questions

What is Grouped-Query Attention?

GQA is an attention variant where several query heads share one key/value head. For 32 query heads with 8 KV groups, every 4 query heads share one K/V projection — shrinking KV-cache memory by 4x with minimal quality loss.

How does GQA differ from MHA and MQA?

MHA has one K/V per query head (biggest cache, full quality). MQA uses one K/V for all heads (smallest cache, quality drop). GQA groups heads so you get ~MQA-sized cache with ~MHA-level quality.

Why does GQA matter for inference?

At long contexts, the KV cache dominates GPU memory and bandwidth. Reducing K/V heads by 4-8x directly shrinks the cache, lets you serve longer contexts, run larger batches, and lowers cost per token.

Do I need to retrain a model to use GQA?

Yes for full training, but Ainslie et al. (2023) showed you can 'uptrain' an existing MHA model into GQA with a small fraction of the pre-training compute by averaging K/V heads and fine-tuning briefly.

Sources

Ainslie et al. — GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — accessed 2026-04-20
Touvron et al. — Llama 2 — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related