Curiosity · Concept

Sliding Window Attention

Sliding Window Attention (SWA) restricts each token to attend only to the previous W tokens instead of the entire sequence. That makes the compute and memory cost of attention linear in sequence length rather than quadratic, which is what makes long contexts practical. Because attention layers stack, information still propagates across distances larger than W — a token at layer L can 'see' up to L*W tokens of history. Mistral 7B popularised SWA in production LLMs by pairing it with Flash Attention and rolling KV buffers.

Quick reference

Proficiency: Intermediate
Also known as: SWA, windowed attention
Prerequisites: Self-attention, Attention mechanism, Context window

Frequently asked questions

What is sliding window attention?

Sliding window attention is a causal attention pattern where each token only attends to the previous W tokens (for example W=4096), instead of every token back to the start. Attention cost becomes linear in sequence length.

How does SWA still handle long-range dependencies?

Because attention is stacked across layers, information propagates layer by layer. A token at layer L can indirectly attend to content up to L*W tokens behind it, giving a much larger effective receptive field than a single window.

Which models use sliding window attention?

Mistral 7B and Mixtral use SWA. Longformer uses a similar local+global pattern. StreamingLLM combines SWA with 'attention sink' tokens to maintain stable streaming generation over very long inputs.

What are the limitations of SWA?

Purely local attention can struggle with very long-range reasoning where information must flow unchanged across many layers. Hybrid designs (SWA + a few global tokens, or interleaved full and local layers) are common fixes.

Sources

Jiang et al. — Mistral 7B — accessed 2026-04-20
Beltagy et al. — Longformer: The Long-Document Transformer — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related