Creativity · Agent Protocol

MLE-Bench — Machine Learning Engineering Benchmark

MLE-Bench puts an agent in front of 75 real Kaggle competitions and scores it the same way Kaggle does — by leaderboard rank. It's one of the most demanding agent benchmarks because it requires long-horizon planning, code execution, error recovery, and genuine ML judgment, not just retrieval.

Protocol facts

Sponsor
OpenAI
Status
stable
Spec
https://github.com/openai/mle-bench
Interop with
OpenAI Assistants, Code interpreter, Docker sandboxes

Frequently asked questions

What makes MLE-Bench hard?

Each competition requires hours of iterative work — understand the problem, explore the data, try multiple models, debug errors, and beat a strong human baseline. It exposes agents' weaknesses in long-horizon reasoning and persistence.

How are scores reported?

MLE-Bench reports medal rate: percentage of competitions where the agent would earn a bronze, silver, or gold medal based on the real Kaggle leaderboard thresholds.

Which agents do best?

Agents with strong code execution, retrieval of past solutions, and long-context planning (GPT-5, Claude, frontier open-weight models) score highest — but medal rates in the 20–30% range remain common as of 2026.

Sources

  1. MLE-Bench GitHub — accessed 2026-04-20
  2. MLE-Bench paper — accessed 2026-04-20