Curiosity · Concept

Context Window

A context window is the fixed upper limit on how many tokens an LLM can attend to at once. In 2026, frontier models offer 200k to 2M token windows, but longer is not always better — cost, latency, and attention quality degrade at the tail. Understanding the real limits behind a published number is key to building good LLM systems.

Quick reference

Proficiency: Beginner
Also known as: context length, max tokens, sequence length
Prerequisites: Tokenization

Frequently asked questions

What is the context window of an LLM?

It is the maximum number of tokens the model can process in a single request. Everything — system prompt, conversation history, retrieved documents, user message, and the generated answer — must fit within this limit.

How big are modern context windows?

Claude 3.5 Sonnet and GPT-4 Turbo at 200k, Claude Opus 4.x at up to 1M in some tiers, Gemini 1.5/2.0 Pro at 1-2M, open-source Llama 3.1 and Qwen 2.5 up to 128k-1M depending on the variant. Context has grown from 2k (GPT-3) to millions in a few years.

What is 'lost in the middle'?

A phenomenon where models retrieve information much better from the beginning and end of a long context than from the middle (Liu et al., 2023). Simply stuffing more documents into a long context does not guarantee the model will use them; position matters.

Should I just use a giant context instead of RAG?

Sometimes. Long context is simpler and handles cross-document reasoning well. But RAG is cheaper per query, updates easily, cites sources, and avoids 'lost in the middle'. For static, bounded corpora a big context can win; for large or changing knowledge bases, RAG usually wins.

Sources

Liu et al. — Lost in the Middle — accessed 2026-04-20
Google DeepMind — Gemini 1.5: Million-token context — accessed 2026-04-20
Anthropic — Claude context windows — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related