Creativity · Agent Protocol

tau-bench — Tool-Augmented Agent Benchmark

tau-bench (τ-bench) evaluates agents in realistic customer-support scenarios: a user asks for help, the agent must consult policies, call APIs (book flights, issue refunds), and converse naturally while staying within business rules. Open-sourced by Sierra in 2024, it's the standard benchmark for enterprise conversational agents.

Protocol facts

Sponsor
Sierra
Status
stable
Spec
https://github.com/sierra-research/tau-bench
Interop with
LangChain, OpenAI Assistants, Anthropic tools

Frequently asked questions

Why does tau-bench simulate customers?

Because real conversational agents must handle ambiguous, multi-turn dialogue. tau-bench uses an LLM to simulate the user, making each test a realistic negotiation rather than a fixed script.

What does 'policy compliance' mean here?

Each domain has a policy document (e.g., 'refunds only within 30 days of purchase'). The agent must follow policy even when the simulated user pressures it — deviations are scored as failures.

How do frontier models score?

As of early 2026, top frontier models complete around 50–65% of tau-bench airline tasks end-to-end — still well below human support agents, which is why the benchmark remains widely cited.

Sources

  1. tau-bench GitHub — accessed 2026-04-20
  2. tau-bench paper — accessed 2026-04-20