Curiosity · Concept
Self-Attention
Self-attention is the core operation of the Transformer. For every token in a sequence, the model computes three vectors — query, key, and value — then scores each token against every other token and mixes values by those scores. The result is a representation for each token that knows about the rest of the sequence.
Quick reference
- Proficiency
- Intermediate
- Also known as
- scaled dot-product attention, intra-attention, QKV attention
- Prerequisites
- Linear algebra (matrix multiplication), Embeddings
Frequently asked questions
What is self-attention?
It is a mechanism where every token in a sequence looks at every other token to decide which ones are relevant, then produces a new representation that blends information from the relevant ones. It is 'self' attention because the sequence attends to itself.
What are Q, K, V in attention?
Each token's embedding is multiplied by three learned weight matrices to produce a query (what this token is looking for), a key (what this token offers), and a value (the actual content). The dot product of queries with keys gives attention scores, which then weight the values.
Why is attention scaled by 1/sqrt(d)?
Without scaling, the dot product QK^T grows in magnitude with the dimension d, pushing the softmax into saturating regions with near-zero gradients. Dividing by sqrt(d) keeps variance stable and training well-conditioned.
What is multi-head attention?
Instead of one attention operation, you run h smaller attention operations in parallel with different learned Q/K/V projections, then concatenate and project. Each head can specialize — syntax, coreference, long-range links — giving the model richer relational capacity.
Sources
- Vaswani et al. — Attention Is All You Need — accessed 2026-04-20
- The Annotated Transformer — Harvard NLP — accessed 2026-04-20