Creativity · Agent Protocol
AgentBench: Multi-Environment LLM Agent Benchmark
AgentBench, introduced by Tsinghua University in 2023 and iterated through 2025, is the first systematic benchmark to evaluate LLMs as autonomous agents across eight heterogeneous environments: operating system, database, knowledge graph, card game, lateral thinking puzzles, house-holding, web shopping, and web browsing. It exposes sharp capability gaps between frontier closed models and open-weight alternatives.
Protocol facts
- Sponsor
- Tsinghua University (KEG Lab)
- Status
- stable
- Spec
- https://github.com/THUDM/AgentBench
- Interop with
- GAIA, WebArena, OSWorld
Frequently asked questions
What environments does AgentBench cover?
Eight: OS (bash tasks), Database (SQL), Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-holding (ALFWorld), Web Shopping (WebShop), and Web Browsing (Mind2Web).
Why evaluate across many environments?
A single environment rewards narrow specialization. AgentBench's breadth exposes whether a model truly generalizes agentic skill or only handles one task family.
How does AgentBench relate to GAIA?
Both target general agent capability, but AgentBench emphasizes controlled, structured environments with programmatic scoring; GAIA targets open-web, tool-chaining questions.
Sources
- AgentBench: Evaluating LLMs as Agents (arXiv 2308.03688) — accessed 2026-04-20
- AgentBench GitHub — accessed 2026-04-20