Curiosity · Concept

MMLU (Massive Multitask Language Understanding)

Released by Hendrycks et al. in 2020, MMLU became the dominant 'general knowledge' benchmark for the GPT-3 era and remains a headline number in most model announcements. Subjects span humanities, STEM, social sciences, and specialized professional areas (US Medical License Exam topics, law, accounting). Because questions are multiple-choice, scoring is simple and reproducible, but the benchmark has drawn criticism for labeling errors, saturation at the top (frontier models now exceed 90%), and data contamination. A harder successor, MMLU-Pro (2024), tightens distractors and removes noisy items.

Quick reference

Proficiency
Beginner
Also known as
Massive Multitask Language Understanding
Prerequisites
LLM basics

Frequently asked questions

What is MMLU?

MMLU is a multiple-choice benchmark covering 57 subjects from elementary math to professional law and medicine, designed to test an LLM's broad academic and professional knowledge. It has roughly 16,000 four-option questions in its test split.

How is it scored?

Accuracy — percent of questions the model answers correctly. It's typically reported zero-shot or with 5 in-context examples. Macro average over all 57 subjects is the headline number.

Why do people distrust high MMLU scores now?

Data contamination (benchmark questions leak into training corpora), label errors in a few percent of items, and saturation at the top — when the best models are at 90%+ the benchmark barely discriminates. MMLU-Pro cleans up noise and raises difficulty.

What should I use alongside MMLU?

Task-specific benchmarks (HumanEval for code, GSM8K / MATH for math, SWE-bench for software engineering), preference-based comparisons like Chatbot Arena, and — most importantly — your own evals on your actual product traffic.

Sources

  1. Hendrycks et al. — Measuring Massive Multitask Language Understanding — accessed 2026-04-20
  2. Wang et al. — MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark — accessed 2026-04-20