Capability · Framework — evals
EleutherAI lm-evaluation-harness
The lm-evaluation-harness is the most widely used academic evaluation framework for LLMs — the engine behind Hugging Face's Open LLM Leaderboard and most pretraining papers. It exposes hundreds of tasks through a uniform interface, supports Hugging Face Transformers, vLLM, OpenAI-compatible APIs, and Anthropic models, and produces CSV/JSON outputs directly comparable across runs.
Framework facts
- Category
- evals
- Language
- Python
- License
- MIT
- Repository
- https://github.com/EleutherAI/lm-evaluation-harness
Install
pip install lm-eval Quickstart
lm_eval --model hf \
--model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct \
--tasks mmlu,gsm8k,hellaswag \
--device cuda --batch_size 8 Alternatives
- OpenCompass — Shanghai AI Lab
- HELM — Stanford CRFM
- OpenAI Evals
Frequently asked questions
Is this the framework behind Open LLM Leaderboard?
Yes — Hugging Face runs the leaderboard on a pinned version of lm-evaluation-harness, which is why the framework is the reference for comparing pretraining runs.
Can I evaluate API models like Claude or GPT-5?
Yes — use `--model openai-chat-completions`, `--model anthropic-chat`, or the generic `--model local-chat-completions` for any OpenAI-compatible endpoint.
Sources
- lm-evaluation-harness GitHub — accessed 2026-04-20
- Open LLM Leaderboard — accessed 2026-04-20