Capability · Framework — evals
HumanEval+ / EvalPlus
EvalPlus (sometimes called HumanEval+ / MBPP+) is a research project that extended OpenAI's HumanEval by auto-generating thousands of additional test cases per problem via mutation and type-aware fuzzing. The result is a far stricter benchmark — many models that score highly on plain HumanEval drop 10-20 pass@1 points on HumanEval+. It's the default benchmark for serious code-LLM comparisons.
Framework facts
- Category
- evals
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/evalplus/evalplus
Install
pip install evalplus --upgrade Quickstart
evalplus.evaluate \
--dataset humaneval \
--samples samples.jsonl \
--base-only false Alternatives
- LiveCodeBench — contamination-resistant
- BigCodeBench
- SWE-bench — repo-scale coding
Frequently asked questions
Why not just use HumanEval?
Original HumanEval has only 7 test cases per problem, so models can pass by generating code that looks right but has subtle bugs. EvalPlus fuzzes hundreds of additional inputs per problem, closing that gap.
Is HumanEval+ still useful in 2026?
For pretraining and small-model research, yes. Frontier coding agents have saturated it — use SWE-bench Verified or LiveCodeBench for higher-ceiling comparisons.
Sources
- EvalPlus GitHub — accessed 2026-04-20
- EvalPlus paper (NeurIPS 2023) — accessed 2026-04-20