Capability · Framework — evals

EleutherAI lm-evaluation-harness

The lm-evaluation-harness is the most widely used academic evaluation framework for LLMs — the engine behind Hugging Face's Open LLM Leaderboard and most pretraining papers. It exposes hundreds of tasks through a uniform interface, supports Hugging Face Transformers, vLLM, OpenAI-compatible APIs, and Anthropic models, and produces CSV/JSON outputs directly comparable across runs.

Framework facts

Category
evals
Language
Python
License
MIT
Repository
https://github.com/EleutherAI/lm-evaluation-harness

Install

pip install lm-eval

Quickstart

lm_eval --model hf \
  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda --batch_size 8

Alternatives

  • OpenCompass — Shanghai AI Lab
  • HELM — Stanford CRFM
  • OpenAI Evals

Frequently asked questions

Is this the framework behind Open LLM Leaderboard?

Yes — Hugging Face runs the leaderboard on a pinned version of lm-evaluation-harness, which is why the framework is the reference for comparing pretraining runs.

Can I evaluate API models like Claude or GPT-5?

Yes — use `--model openai-chat-completions`, `--model anthropic-chat`, or the generic `--model local-chat-completions` for any OpenAI-compatible endpoint.

Sources

  1. lm-evaluation-harness GitHub — accessed 2026-04-20
  2. Open LLM Leaderboard — accessed 2026-04-20