Curiosity · Concept
Local Attention
Local Attention is the umbrella term for any attention pattern that restricts each token to a bounded neighbourhood, trading full global connectivity for linear or near-linear cost. Sliding-window attention, block-local attention, and dilated attention are all local-attention variants. Local attention is usually paired with some small number of global tokens (e.g., the Longformer [CLS] token, or StreamingLLM 'attention sink' tokens) so the model can still pool information across the whole sequence.
Quick reference
- Proficiency
- Intermediate
- Also known as
- local self-attention, windowed attention
- Prerequisites
- Self-attention, Attention mechanism
Frequently asked questions
What is local attention?
Local attention is any attention pattern where each token can only attend to a bounded local neighbourhood of tokens. This turns the O(N^2) cost of full attention into something closer to O(N*W) for a window size W.
What's the difference between local and sliding window attention?
Sliding window attention is a specific local-attention pattern where the window slides with the token position. Other local variants include fixed blocks (Reformer), dilated windows (Longformer), and random + local mixes (BigBird).
How does local attention handle long-range dependencies?
Two tricks: (1) stack layers so the receptive field grows with depth, and (2) add a small set of 'global' tokens that attend to — and are attended by — the whole sequence. Longformer and BigBird use both.
When should I use local attention?
When sequences are long (documents, code, audio) and full attention cost becomes the bottleneck, and when the task has strong locality so most relevant context lives nearby.
Sources
- Beltagy et al. — Longformer: The Long-Document Transformer — accessed 2026-04-20
- Zaheer et al. — BigBird: Transformers for Longer Sequences — accessed 2026-04-20