Curiosity · Concept

Self-Attention

Self-attention is the core operation of the Transformer. For every token in a sequence, the model computes three vectors — query, key, and value — then scores each token against every other token and mixes values by those scores. The result is a representation for each token that knows about the rest of the sequence.

Quick reference

Proficiency
Intermediate
Also known as
scaled dot-product attention, intra-attention, QKV attention
Prerequisites
Linear algebra (matrix multiplication), Embeddings

Frequently asked questions

What is self-attention?

It is a mechanism where every token in a sequence looks at every other token to decide which ones are relevant, then produces a new representation that blends information from the relevant ones. It is 'self' attention because the sequence attends to itself.

What are Q, K, V in attention?

Each token's embedding is multiplied by three learned weight matrices to produce a query (what this token is looking for), a key (what this token offers), and a value (the actual content). The dot product of queries with keys gives attention scores, which then weight the values.

Why is attention scaled by 1/sqrt(d)?

Without scaling, the dot product QK^T grows in magnitude with the dimension d, pushing the softmax into saturating regions with near-zero gradients. Dividing by sqrt(d) keeps variance stable and training well-conditioned.

What is multi-head attention?

Instead of one attention operation, you run h smaller attention operations in parallel with different learned Q/K/V projections, then concatenate and project. Each head can specialize — syntax, coreference, long-range links — giving the model richer relational capacity.

Sources

  1. Vaswani et al. — Attention Is All You Need — accessed 2026-04-20
  2. The Annotated Transformer — Harvard NLP — accessed 2026-04-20