Capability · Framework — evals

HumanEval+ / EvalPlus

EvalPlus (sometimes called HumanEval+ / MBPP+) is a research project that extended OpenAI's HumanEval by auto-generating thousands of additional test cases per problem via mutation and type-aware fuzzing. The result is a far stricter benchmark — many models that score highly on plain HumanEval drop 10-20 pass@1 points on HumanEval+. It's the default benchmark for serious code-LLM comparisons.

Framework facts

Category: evals
Language: Python
License: Apache-2.0
Repository: https://github.com/evalplus/evalplus

Install

pip install evalplus --upgrade

Quickstart

evalplus.evaluate \
  --dataset humaneval \
  --samples samples.jsonl \
  --base-only false

Alternatives

LiveCodeBench — contamination-resistant
BigCodeBench
SWE-bench — repo-scale coding

Frequently asked questions

Why not just use HumanEval?

Original HumanEval has only 7 test cases per problem, so models can pass by generating code that looks right but has subtle bugs. EvalPlus fuzzes hundreds of additional inputs per problem, closing that gap.

Is HumanEval+ still useful in 2026?

For pretraining and small-model research, yes. Frontier coding agents have saturated it — use SWE-bench Verified or LiveCodeBench for higher-ceiling comparisons.

Sources

EvalPlus GitHub — accessed 2026-04-20
EvalPlus paper (NeurIPS 2023) — accessed 2026-04-20