Creativity · Agent Protocol
tau-bench — Tool-Augmented Agent Benchmark
tau-bench (τ-bench) evaluates agents in realistic customer-support scenarios: a user asks for help, the agent must consult policies, call APIs (book flights, issue refunds), and converse naturally while staying within business rules. Open-sourced by Sierra in 2024, it's the standard benchmark for enterprise conversational agents.
Protocol facts
- Sponsor
- Sierra
- Status
- stable
- Spec
- https://github.com/sierra-research/tau-bench
- Interop with
- LangChain, OpenAI Assistants, Anthropic tools
Frequently asked questions
Why does tau-bench simulate customers?
Because real conversational agents must handle ambiguous, multi-turn dialogue. tau-bench uses an LLM to simulate the user, making each test a realistic negotiation rather than a fixed script.
What does 'policy compliance' mean here?
Each domain has a policy document (e.g., 'refunds only within 30 days of purchase'). The agent must follow policy even when the simulated user pressures it — deviations are scored as failures.
How do frontier models score?
As of early 2026, top frontier models complete around 50–65% of tau-bench airline tasks end-to-end — still well below human support agents, which is why the benchmark remains widely cited.
Sources
- tau-bench GitHub — accessed 2026-04-20
- tau-bench paper — accessed 2026-04-20