Creativity · Agent Protocol
MLE-Bench — Machine Learning Engineering Benchmark
MLE-Bench puts an agent in front of 75 real Kaggle competitions and scores it the same way Kaggle does — by leaderboard rank. It's one of the most demanding agent benchmarks because it requires long-horizon planning, code execution, error recovery, and genuine ML judgment, not just retrieval.
Protocol facts
- Sponsor
- OpenAI
- Status
- stable
- Spec
- https://github.com/openai/mle-bench
- Interop with
- OpenAI Assistants, Code interpreter, Docker sandboxes
Frequently asked questions
What makes MLE-Bench hard?
Each competition requires hours of iterative work — understand the problem, explore the data, try multiple models, debug errors, and beat a strong human baseline. It exposes agents' weaknesses in long-horizon reasoning and persistence.
How are scores reported?
MLE-Bench reports medal rate: percentage of competitions where the agent would earn a bronze, silver, or gold medal based on the real Kaggle leaderboard thresholds.
Which agents do best?
Agents with strong code execution, retrieval of past solutions, and long-context planning (GPT-5, Claude, frontier open-weight models) score highest — but medal rates in the 20–30% range remain common as of 2026.
Sources
- MLE-Bench GitHub — accessed 2026-04-20
- MLE-Bench paper — accessed 2026-04-20