Creativity · Agent Protocol

WebArena: Realistic Web-Agent Benchmark

WebArena, released by Carnegie Mellon in 2023, is a self-hostable benchmark environment with four fully-functional websites (OneStopShop, Reddit clone, Gitea, Adobe Magento) plus a CMS and map service. Agents receive natural-language instructions like 'create a pull request that fixes the typo in README' and are scored on functional outcome, not trajectory. It's the canonical realistic-web benchmark.

Protocol facts

Sponsor: Carnegie Mellon University
Status: stable
Spec: https://webarena.dev/
Interop with: AgentBench, OSWorld, browser-use

Frequently asked questions

Why is WebArena self-hosted?

Live websites break, change, and rate-limit. A self-hosted stack gives deterministic, reproducible evaluation — critical for comparing agents fairly over time.

How are tasks scored?

By functional outcome: database state checks, URL patterns, or exact-answer matches. Trajectory doesn't matter — only whether the end state is correct.

What's VisualWebArena?

A follow-up that adds visually-grounded tasks (e.g., 'click the red button shaped like a star') requiring vision-language models rather than text-only HTML parsing.

Sources

WebArena: A Realistic Web Environment (arXiv 2307.13854) — accessed 2026-04-20
WebArena official site — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related