Creativity · Agent Protocol

OSWorld: Real Operating System Agent Benchmark

OSWorld, from a consortium including HKU, Salesforce, and CMU, provides 369 real computer tasks across Ubuntu, Windows, and macOS that agents must complete by controlling a full desktop — opening apps, editing files, configuring settings. Unlike web-only benchmarks, OSWorld tests cross-application workflows and surfaces the brittleness of current computer-use agents.

Protocol facts

Sponsor
HKU + Salesforce + CMU
Status
stable
Spec
https://os-world.github.io/
Interop with
Anthropic Computer Use, WebArena, GAIA

Frequently asked questions

How is OSWorld graded?

Each task ships with an execution-based grader that inspects the post-task OS state — file contents, installed apps, clipboard, config files — to verify the goal was actually achieved.

Which agents perform best on OSWorld?

As of 2026, Anthropic's Computer Use (Sonnet 4.6+) and OpenAI's computer-using agents lead, but scores remain well below human performance — the benchmark is far from saturated.

Is OSWorld open source?

Yes — the task suite, VM images, and grading harness are published on GitHub under an Apache-2.0 license.

Sources

  1. OSWorld (arXiv 2404.07972) — accessed 2026-04-20
  2. OSWorld project page — accessed 2026-04-20