Creativity · Agent Protocol
OpenAI Evals for Agent Workflows
OpenAI Evals started as an open-source framework for grading model outputs, then matured in 2024-2025 into a hosted Evals product inside the OpenAI platform. It supports dataset-based evaluations, LLM-as-judge graders, Python custom graders, and trajectory evals for agentic runs — now a first-class part of the Assistants and Agents API workflow.
Protocol facts
- Sponsor
- OpenAI
- Status
- stable
- Spec
- https://platform.openai.com/docs/guides/evals
- Interop with
- OpenAI Agents SDK, Assistants API, Pydantic
Frequently asked questions
What's the difference between the OSS Evals repo and the hosted Evals product?
The OSS repo (github.com/openai/evals) is a lightweight Python framework; the hosted product adds a UI, storage, scheduled runs, and integration with the Responses API. Many teams start OSS, then move to hosted for production.
How do trajectory evals differ from single-turn evals?
Trajectory evals score a whole agent run — the sequence of tool calls, intermediate messages, and final output — rather than just the final string. This is essential for debugging multi-step agent failures.
Does OpenAI Evals work with non-OpenAI models?
The OSS framework is model-agnostic (you supply the completion function). The hosted product primarily targets OpenAI models but can grade outputs from any source.
Sources
- OpenAI Evals guide — accessed 2026-04-20
- OpenAI Evals GitHub — accessed 2026-04-20