Creativity · Agent Protocol

WebArena: Realistic Web-Agent Benchmark

WebArena, released by Carnegie Mellon in 2023, is a self-hostable benchmark environment with four fully-functional websites (OneStopShop, Reddit clone, Gitea, Adobe Magento) plus a CMS and map service. Agents receive natural-language instructions like 'create a pull request that fixes the typo in README' and are scored on functional outcome, not trajectory. It's the canonical realistic-web benchmark.

Protocol facts

Sponsor
Carnegie Mellon University
Status
stable
Spec
https://webarena.dev/
Interop with
AgentBench, OSWorld, browser-use

Frequently asked questions

Why is WebArena self-hosted?

Live websites break, change, and rate-limit. A self-hosted stack gives deterministic, reproducible evaluation — critical for comparing agents fairly over time.

How are tasks scored?

By functional outcome: database state checks, URL patterns, or exact-answer matches. Trajectory doesn't matter — only whether the end state is correct.

What's VisualWebArena?

A follow-up that adds visually-grounded tasks (e.g., 'click the red button shaped like a star') requiring vision-language models rather than text-only HTML parsing.

Sources

  1. WebArena: A Realistic Web Environment (arXiv 2307.13854) — accessed 2026-04-20
  2. WebArena official site — accessed 2026-04-20