Creativity · Agent Protocol
OSWorld: Real Operating System Agent Benchmark
OSWorld, from a consortium including HKU, Salesforce, and CMU, provides 369 real computer tasks across Ubuntu, Windows, and macOS that agents must complete by controlling a full desktop — opening apps, editing files, configuring settings. Unlike web-only benchmarks, OSWorld tests cross-application workflows and surfaces the brittleness of current computer-use agents.
Protocol facts
- Sponsor
- HKU + Salesforce + CMU
- Status
- stable
- Spec
- https://os-world.github.io/
- Interop with
- Anthropic Computer Use, WebArena, GAIA
Frequently asked questions
How is OSWorld graded?
Each task ships with an execution-based grader that inspects the post-task OS state — file contents, installed apps, clipboard, config files — to verify the goal was actually achieved.
Which agents perform best on OSWorld?
As of 2026, Anthropic's Computer Use (Sonnet 4.6+) and OpenAI's computer-using agents lead, but scores remain well below human performance — the benchmark is far from saturated.
Is OSWorld open source?
Yes — the task suite, VM images, and grading harness are published on GitHub under an Apache-2.0 license.
Sources
- OSWorld (arXiv 2404.07972) — accessed 2026-04-20
- OSWorld project page — accessed 2026-04-20