Capability · Framework — evals
Stanford HELM
HELM established the 'leaderboard-but-with-ethics-dimensions' template. Every model run is scored across many scenarios and many metrics — not just accuracy but also robustness, fairness, toxicity, and efficiency. HELM Lite, HELM Instruct, and HELM Classic leaderboards report hundreds of models across public and private labs.
Framework facts
- Category
- evals
- Language
- Python
- License
- Apache-2.0
- Repository
- https://github.com/stanford-crfm/helm
Install
pip install crfm-helm Quickstart
# quickly run a local eval run
helm-run --run-entries mmlu:subject=abstract_algebra,model=openai/gpt-4o-mini \
--suite my-suite --max-eval-instances 10
helm-summarize --suite my-suite
helm-server --suite my-suite # browse results at http://localhost:8000 Alternatives
- OpenCompass — large Chinese-led equivalent
- lm-eval-harness — lighter-weight EleutherAI standard
- BigBench / BBH — specific reasoning tasks
- MLCommons AI Benchmarks — industry consortium
Frequently asked questions
What does 'holistic' mean here?
It means reporting a full matrix of metrics per scenario — not just accuracy — so you can see, e.g., that a model is accurate but poorly calibrated, or robust but slow.
How expensive is running HELM?
A full HELM Classic run on a frontier model costs thousands of dollars in API credits. HELM Lite is the affordable subset most teams use; `helm-run` lets you scope it further with `--max-eval-instances`.
Sources
- HELM website — accessed 2026-04-20
- HELM GitHub — accessed 2026-04-20