Curiosity · Concept
SWE-bench
SWE-bench, introduced by Jimenez et al. (2023), moved code evaluation beyond toy functions toward real software engineering: each task gives the model a full repository at a specific commit, an issue description, and a hidden test suite; the model must generate a patch that makes the tests pass. The original benchmark has 2,294 tasks drawn from 12 popular Python projects. SWE-bench Verified, curated by OpenAI in 2024, is a higher-quality 500-task subset; SWE-bench Multimodal and Multilingual variants extend to images and other languages. It has become the de facto leaderboard for agentic coding systems.
Quick reference
- Proficiency
- Intermediate
- Also known as
- Software Engineering Bench
- Prerequisites
- agents, tool calling
Frequently asked questions
What is SWE-bench?
SWE-bench is a benchmark where an LLM or agent is given a real GitHub issue and the corresponding repository at a specific commit, and must generate a patch that resolves the issue. Success is measured by a hidden test suite authored with the original PR.
How is it different from HumanEval or MBPP?
HumanEval and MBPP are self-contained one-function problems solved in a few lines. SWE-bench tasks require navigating a large codebase, understanding existing patterns, and editing multiple files. It's a much closer proxy for real engineering work.
What is SWE-bench Verified?
A 500-task subset curated by OpenAI in August 2024 after removing underspecified, flaky, or mislabeled items from the original 2,294-task set. It's the common leaderboard today because scores are more trustworthy.
What scores are competitive in 2026?
Frontier agentic systems (Claude, OpenAI, and open-source harnesses like SWE-agent) are in the 60-80% range on Verified. Check the SWE-bench leaderboard for current numbers — it moves fast.
Sources
- Jimenez et al. — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — accessed 2026-04-20
- SWE-bench leaderboard — accessed 2026-04-20