Capability · Framework — evals

OpenAI Evals

OpenAI Evals gave the community a common shape for evals: a YAML that names a dataset and a grading template, plus a small runner that shells out to the OpenAI API or a local model. It's been surpassed in scope by HELM and OpenCompass, but remains the reference toolkit for authoring a new eval cheaply.

Framework facts

Category
evals
Language
Python
License
MIT
Repository
https://github.com/openai/evals

Install

git clone https://github.com/openai/evals.git
cd evals && pip install -e .

Quickstart

oaieval gpt-4o-mini test-match
# or create a new eval:
# evals/registry/evals/my-eval.yaml points to a .jsonl and a match grader
oaieval gpt-4o-mini my-eval

Alternatives

  • HELM — Stanford's broader evaluator
  • OpenCompass — Chinese-led comprehensive suite
  • lm-eval-harness — EleutherAI standard
  • Inspect AI — newer eval framework with rich grading

Frequently asked questions

Is this the same as the OpenAI Evals dashboard in the API product?

Related but not identical. The open-source repo is a code-first CLI for authoring and running evals locally. The product dashboard uses a superset of the same conventions but is managed, with built-in grading and traces.

Does it work with non-OpenAI models?

Yes, via the `completion_fns` plug-in system. You can wire it to Anthropic, local Hugging Face, vLLM, or any HTTP endpoint.

Sources

  1. OpenAI Evals GitHub — accessed 2026-04-20
  2. OpenAI evals product — accessed 2026-04-20