Curiosity · Concept

SWE-bench

SWE-bench, introduced by Jimenez et al. (2023), moved code evaluation beyond toy functions toward real software engineering: each task gives the model a full repository at a specific commit, an issue description, and a hidden test suite; the model must generate a patch that makes the tests pass. The original benchmark has 2,294 tasks drawn from 12 popular Python projects. SWE-bench Verified, curated by OpenAI in 2024, is a higher-quality 500-task subset; SWE-bench Multimodal and Multilingual variants extend to images and other languages. It has become the de facto leaderboard for agentic coding systems.

Quick reference

Proficiency: Intermediate
Also known as: Software Engineering Bench
Prerequisites: agents, tool calling

Frequently asked questions

What is SWE-bench?

SWE-bench is a benchmark where an LLM or agent is given a real GitHub issue and the corresponding repository at a specific commit, and must generate a patch that resolves the issue. Success is measured by a hidden test suite authored with the original PR.

How is it different from HumanEval or MBPP?

HumanEval and MBPP are self-contained one-function problems solved in a few lines. SWE-bench tasks require navigating a large codebase, understanding existing patterns, and editing multiple files. It's a much closer proxy for real engineering work.

What is SWE-bench Verified?

A 500-task subset curated by OpenAI in August 2024 after removing underspecified, flaky, or mislabeled items from the original 2,294-task set. It's the common leaderboard today because scores are more trustworthy.

What scores are competitive in 2026?

Frontier agentic systems (Claude, OpenAI, and open-source harnesses like SWE-agent) are in the 60-80% range on Verified. Check the SWE-bench leaderboard for current numbers — it moves fast.

Sources

Jimenez et al. — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — accessed 2026-04-20
SWE-bench leaderboard — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related