Capability · Framework — evals

Stanford HELM

HELM established the 'leaderboard-but-with-ethics-dimensions' template. Every model run is scored across many scenarios and many metrics — not just accuracy but also robustness, fairness, toxicity, and efficiency. HELM Lite, HELM Instruct, and HELM Classic leaderboards report hundreds of models across public and private labs.

Framework facts

Category
evals
Language
Python
License
Apache-2.0
Repository
https://github.com/stanford-crfm/helm

Install

pip install crfm-helm

Quickstart

# quickly run a local eval run
helm-run --run-entries mmlu:subject=abstract_algebra,model=openai/gpt-4o-mini \
         --suite my-suite --max-eval-instances 10
helm-summarize --suite my-suite
helm-server --suite my-suite   # browse results at http://localhost:8000

Alternatives

  • OpenCompass — large Chinese-led equivalent
  • lm-eval-harness — lighter-weight EleutherAI standard
  • BigBench / BBH — specific reasoning tasks
  • MLCommons AI Benchmarks — industry consortium

Frequently asked questions

What does 'holistic' mean here?

It means reporting a full matrix of metrics per scenario — not just accuracy — so you can see, e.g., that a model is accurate but poorly calibrated, or robust but slow.

How expensive is running HELM?

A full HELM Classic run on a frontier model costs thousands of dollars in API credits. HELM Lite is the affordable subset most teams use; `helm-run` lets you scope it further with `--max-eval-instances`.

Sources

  1. HELM website — accessed 2026-04-20
  2. HELM GitHub — accessed 2026-04-20