Curiosity · Concept

SWE-bench

SWE-bench, introduced by Jimenez et al. (2023), moved code evaluation beyond toy functions toward real software engineering: each task gives the model a full repository at a specific commit, an issue description, and a hidden test suite; the model must generate a patch that makes the tests pass. The original benchmark has 2,294 tasks drawn from 12 popular Python projects. SWE-bench Verified, curated by OpenAI in 2024, is a higher-quality 500-task subset; SWE-bench Multimodal and Multilingual variants extend to images and other languages. It has become the de facto leaderboard for agentic coding systems.

Quick reference

Proficiency
Intermediate
Also known as
Software Engineering Bench
Prerequisites
agents, tool calling

Frequently asked questions

What is SWE-bench?

SWE-bench is a benchmark where an LLM or agent is given a real GitHub issue and the corresponding repository at a specific commit, and must generate a patch that resolves the issue. Success is measured by a hidden test suite authored with the original PR.

How is it different from HumanEval or MBPP?

HumanEval and MBPP are self-contained one-function problems solved in a few lines. SWE-bench tasks require navigating a large codebase, understanding existing patterns, and editing multiple files. It's a much closer proxy for real engineering work.

What is SWE-bench Verified?

A 500-task subset curated by OpenAI in August 2024 after removing underspecified, flaky, or mislabeled items from the original 2,294-task set. It's the common leaderboard today because scores are more trustworthy.

What scores are competitive in 2026?

Frontier agentic systems (Claude, OpenAI, and open-source harnesses like SWE-agent) are in the 60-80% range on Verified. Check the SWE-bench leaderboard for current numbers — it moves fast.

Sources

  1. Jimenez et al. — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — accessed 2026-04-20
  2. SWE-bench leaderboard — accessed 2026-04-20