Creativity · Agent Protocol

LongBench — Long-Horizon Agent Benchmark

LongBench tests what happens when you push an agent past five or ten turns: can it maintain a plan across 50 tool calls? Can it find the one relevant fact buried in a 100K-token document? It's become a core benchmark for evaluating context engineering, memory, and sustained reasoning — all the things real agents actually need.

Protocol facts

Sponsor: Academic (Tsinghua + collaborators)
Status: stable
Spec: https://github.com/THUDM/LongBench
Interop with: Claude 200K+ context, Gemini long-context, GPT-5

Frequently asked questions

What distinguishes long-horizon from long-context?

Long-context = one huge input. Long-horizon = many small inputs over many steps. LongBench tests both, because a real agent often needs to operate on large documents over many turns.

Why is long-horizon hard for agents?

Errors compound. A 95% per-step success rate drops below 50% after 15 steps. Agents need to maintain plans, recover from mistakes, and avoid context poisoning from accumulated history.

What's the relationship with LongBench v2?

LongBench v2 (2024) extends the original with harder multi-document reasoning and tasks that require synthesizing information across 100K+ token inputs — tracking frontier model capability.

Sources

LongBench GitHub — accessed 2026-04-20
LongBench paper — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related