Capability · Framework — evals

Stanford HELM

HELM established the 'leaderboard-but-with-ethics-dimensions' template. Every model run is scored across many scenarios and many metrics — not just accuracy but also robustness, fairness, toxicity, and efficiency. HELM Lite, HELM Instruct, and HELM Classic leaderboards report hundreds of models across public and private labs.

Framework facts

Category: evals
Language: Python
License: Apache-2.0
Repository: https://github.com/stanford-crfm/helm

Install

pip install crfm-helm

Quickstart

# quickly run a local eval run
helm-run --run-entries mmlu:subject=abstract_algebra,model=openai/gpt-4o-mini \
         --suite my-suite --max-eval-instances 10
helm-summarize --suite my-suite
helm-server --suite my-suite   # browse results at http://localhost:8000

Alternatives

OpenCompass — large Chinese-led equivalent
lm-eval-harness — lighter-weight EleutherAI standard
BigBench / BBH — specific reasoning tasks
MLCommons AI Benchmarks — industry consortium

Frequently asked questions

What does 'holistic' mean here?

It means reporting a full matrix of metrics per scenario — not just accuracy — so you can see, e.g., that a model is accurate but poorly calibrated, or robust but slow.

How expensive is running HELM?

A full HELM Classic run on a frontier model costs thousands of dollars in API credits. HELM Lite is the affordable subset most teams use; `helm-run` lets you scope it further with `--max-eval-instances`.

Sources

HELM website — accessed 2026-04-20
HELM GitHub — accessed 2026-04-20