Curiosity · Concept
Positional Encoding
Self-attention treats its input as a set — it has no notion of order. Positional encoding fixes this by adding a position-dependent signal to token embeddings so the model knows 'the cat sat on the mat' differs from 'the mat sat on the cat'. Schemes range from fixed sinusoids to learned embeddings to modern rotary encodings.
Quick reference
- Proficiency
- Intermediate
- Also known as
- position embeddings, positional embeddings
- Prerequisites
- Self-attention, Transformer architecture
Frequently asked questions
What is positional encoding?
A mechanism that adds position information to each token's representation so the Transformer knows the order of the sequence. Without it, the attention operation treats the input as an unordered set.
What's the difference between absolute and relative positional encodings?
Absolute encodings (sinusoidal or learned) give each position a unique vector. Relative encodings describe the offset between two tokens (how far apart they are), which generalizes better to sequences longer than those seen in training. Modern LLMs mostly use relative variants like RoPE or ALiBi.
Why did modern LLMs move from sinusoidal to RoPE?
Rotary Position Embeddings apply a rotation to query and key vectors inside attention, so relative position falls out naturally. RoPE extrapolates better to long contexts and plays well with techniques like YaRN for context extension — that's why Llama, Qwen, and DeepSeek use it.
What is ALiBi?
Attention with Linear Biases adds a distance-based penalty to attention scores instead of modifying embeddings. It is simple, extrapolates to much longer contexts than training, and is used by models like BLOOM and MPT.
Sources
- Vaswani et al. — Attention Is All You Need — accessed 2026-04-20
- Su et al. — RoFormer: Rotary Position Embedding — accessed 2026-04-20
- Press et al. — ALiBi — accessed 2026-04-20