Creativity · Agent Protocol

LongBench — Long-Horizon Agent Benchmark

LongBench tests what happens when you push an agent past five or ten turns: can it maintain a plan across 50 tool calls? Can it find the one relevant fact buried in a 100K-token document? It's become a core benchmark for evaluating context engineering, memory, and sustained reasoning — all the things real agents actually need.

Protocol facts

Sponsor
Academic (Tsinghua + collaborators)
Status
stable
Spec
https://github.com/THUDM/LongBench
Interop with
Claude 200K+ context, Gemini long-context, GPT-5

Frequently asked questions

What distinguishes long-horizon from long-context?

Long-context = one huge input. Long-horizon = many small inputs over many steps. LongBench tests both, because a real agent often needs to operate on large documents over many turns.

Why is long-horizon hard for agents?

Errors compound. A 95% per-step success rate drops below 50% after 15 steps. Agents need to maintain plans, recover from mistakes, and avoid context poisoning from accumulated history.

What's the relationship with LongBench v2?

LongBench v2 (2024) extends the original with harder multi-document reasoning and tasks that require synthesizing information across 100K+ token inputs — tracking frontier model capability.

Sources

  1. LongBench GitHub — accessed 2026-04-20
  2. LongBench paper — accessed 2026-04-20