Creativity · Agent Protocol
WebArena: Realistic Web-Agent Benchmark
WebArena, released by Carnegie Mellon in 2023, is a self-hostable benchmark environment with four fully-functional websites (OneStopShop, Reddit clone, Gitea, Adobe Magento) plus a CMS and map service. Agents receive natural-language instructions like 'create a pull request that fixes the typo in README' and are scored on functional outcome, not trajectory. It's the canonical realistic-web benchmark.
Protocol facts
- Sponsor
- Carnegie Mellon University
- Status
- stable
- Spec
- https://webarena.dev/
- Interop with
- AgentBench, OSWorld, browser-use
Frequently asked questions
Why is WebArena self-hosted?
Live websites break, change, and rate-limit. A self-hosted stack gives deterministic, reproducible evaluation — critical for comparing agents fairly over time.
How are tasks scored?
By functional outcome: database state checks, URL patterns, or exact-answer matches. Trajectory doesn't matter — only whether the end state is correct.
What's VisualWebArena?
A follow-up that adds visually-grounded tasks (e.g., 'click the red button shaped like a star') requiring vision-language models rather than text-only HTML parsing.
Sources
- WebArena: A Realistic Web Environment (arXiv 2307.13854) — accessed 2026-04-20
- WebArena official site — accessed 2026-04-20