Creativity · Agent Protocol

OpenAI Evals for Agent Workflows

OpenAI Evals started as an open-source framework for grading model outputs, then matured in 2024-2025 into a hosted Evals product inside the OpenAI platform. It supports dataset-based evaluations, LLM-as-judge graders, Python custom graders, and trajectory evals for agentic runs — now a first-class part of the Assistants and Agents API workflow.

Protocol facts

Sponsor
OpenAI
Status
stable
Spec
https://platform.openai.com/docs/guides/evals
Interop with
OpenAI Agents SDK, Assistants API, Pydantic

Frequently asked questions

What's the difference between the OSS Evals repo and the hosted Evals product?

The OSS repo (github.com/openai/evals) is a lightweight Python framework; the hosted product adds a UI, storage, scheduled runs, and integration with the Responses API. Many teams start OSS, then move to hosted for production.

How do trajectory evals differ from single-turn evals?

Trajectory evals score a whole agent run — the sequence of tool calls, intermediate messages, and final output — rather than just the final string. This is essential for debugging multi-step agent failures.

Does OpenAI Evals work with non-OpenAI models?

The OSS framework is model-agnostic (you supply the completion function). The hosted product primarily targets OpenAI models but can grade outputs from any source.

Sources

  1. OpenAI Evals guide — accessed 2026-04-20
  2. OpenAI Evals GitHub — accessed 2026-04-20