Capability · Framework — evals
OpenAI Evals
OpenAI Evals gave the community a common shape for evals: a YAML that names a dataset and a grading template, plus a small runner that shells out to the OpenAI API or a local model. It's been surpassed in scope by HELM and OpenCompass, but remains the reference toolkit for authoring a new eval cheaply.
Framework facts
- Category
- evals
- Language
- Python
- License
- MIT
- Repository
- https://github.com/openai/evals
Install
git clone https://github.com/openai/evals.git
cd evals && pip install -e . Quickstart
oaieval gpt-4o-mini test-match
# or create a new eval:
# evals/registry/evals/my-eval.yaml points to a .jsonl and a match grader
oaieval gpt-4o-mini my-eval Alternatives
- HELM — Stanford's broader evaluator
- OpenCompass — Chinese-led comprehensive suite
- lm-eval-harness — EleutherAI standard
- Inspect AI — newer eval framework with rich grading
Frequently asked questions
Is this the same as the OpenAI Evals dashboard in the API product?
Related but not identical. The open-source repo is a code-first CLI for authoring and running evals locally. The product dashboard uses a superset of the same conventions but is managed, with built-in grading and traces.
Does it work with non-OpenAI models?
Yes, via the `completion_fns` plug-in system. You can wire it to Anthropic, local Hugging Face, vLLM, or any HTTP endpoint.
Sources
- OpenAI Evals GitHub — accessed 2026-04-20
- OpenAI evals product — accessed 2026-04-20