Creativity · Agent Protocol

SWE-bench for Agents: Evaluating Coding Agents

SWE-bench, introduced by Princeton in late 2023, has become the canonical benchmark for autonomous coding agents. Agents are given a real GitHub issue and the repo's codebase, and must produce a patch that (a) resolves the issue and (b) passes the project's existing test suite. Variants include SWE-bench Verified (500 human-validated tasks), SWE-bench Lite, and SWE-bench Multimodal.

Protocol facts

Sponsor
Princeton University + OpenAI (Verified)
Status
stable
Spec
https://www.swebench.com/
Interop with
Devin, OpenHands, SWE-agent, Aider

Frequently asked questions

What is SWE-bench Verified?

A 500-task subset curated by OpenAI in 2024 where human annotators confirmed each task is solvable, the tests are reliable, and the issue is clear. It's the preferred leaderboard for serious comparison.

How do agents submit to SWE-bench?

Agents produce a unified diff patch for each instance; the harness applies the patch and runs the project's test suite in Docker. Score = fraction of issues resolved.

Is SWE-bench still informative as scores rise?

Top agents now exceed 70% on Verified, so discriminating power drops. SWE-bench Multimodal and harder splits like SWE-bench Pro have been introduced to keep the signal strong.

Sources

  1. SWE-bench (arXiv 2310.06770) — accessed 2026-04-20
  2. SWE-bench official site — accessed 2026-04-20