Creativity · Agent Protocol

AgentBench: Multi-Environment LLM Agent Benchmark

AgentBench, introduced by Tsinghua University in 2023 and iterated through 2025, is the first systematic benchmark to evaluate LLMs as autonomous agents across eight heterogeneous environments: operating system, database, knowledge graph, card game, lateral thinking puzzles, house-holding, web shopping, and web browsing. It exposes sharp capability gaps between frontier closed models and open-weight alternatives.

Protocol facts

Sponsor
Tsinghua University (KEG Lab)
Status
stable
Spec
https://github.com/THUDM/AgentBench
Interop with
GAIA, WebArena, OSWorld

Frequently asked questions

What environments does AgentBench cover?

Eight: OS (bash tasks), Database (SQL), Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles, House-holding (ALFWorld), Web Shopping (WebShop), and Web Browsing (Mind2Web).

Why evaluate across many environments?

A single environment rewards narrow specialization. AgentBench's breadth exposes whether a model truly generalizes agentic skill or only handles one task family.

How does AgentBench relate to GAIA?

Both target general agent capability, but AgentBench emphasizes controlled, structured environments with programmatic scoring; GAIA targets open-web, tool-chaining questions.

Sources

  1. AgentBench: Evaluating LLMs as Agents (arXiv 2308.03688) — accessed 2026-04-20
  2. AgentBench GitHub — accessed 2026-04-20