Curiosity · Concept

Attention Mechanism

Attention was introduced by Bahdanau et al. (2014) to let neural machine translation align output words with relevant input words instead of squishing the whole sentence into one fixed vector. The modern scaled dot-product form — softmax(QKᵀ/√d)V — became the core operation of the Transformer in 2017 and is the reason LLMs can handle long-range dependencies. Attention is a general primitive: it shows up in vision transformers, graph neural networks, and speech models. Self-attention (where Q, K, V come from the same sequence) is the special case that powers GPT, Claude, and everything since.

Quick reference

Proficiency
Intermediate
Also known as
scaled dot-product attention
Prerequisites
neural networks, matrix multiplication

Frequently asked questions

What is the attention mechanism?

Attention is a neural network operation that computes a weighted sum of input values for each output position, where the weights are learned from the similarity between a query vector and a set of key vectors. It lets the model focus on the most relevant parts of the input for each output.

What do Q, K, V stand for?

Query (what am I looking for), Key (what this position offers), and Value (the actual content). Scores = QKᵀ/√d, weights = softmax(scores), output = weights × V. They're usually linear projections of the same input in self-attention.

Why divide by √d?

Without it, dot products between high-dimensional vectors grow large, pushing the softmax into saturated regions with vanishingly small gradients. Dividing by the square root of the key dimension keeps the variance of the scores controlled.

Self-attention vs cross-attention?

Self-attention: queries, keys, and values all come from the same sequence — used inside every transformer layer. Cross-attention: queries come from one sequence (e.g., the decoder) and keys/values from another (e.g., the encoder), used in encoder-decoder models for translation and multimodal fusion.

Sources

  1. Vaswani et al. — Attention Is All You Need — accessed 2026-04-20
  2. Bahdanau et al. — Neural Machine Translation by Jointly Learning to Align and Translate — accessed 2026-04-20