Creativity · Agent Protocol
Agent Cost and Token Budget Patterns
Because agents loop, their token consumption is unbounded by default. A single runaway run has cost teams thousands of dollars. Production agents adopt explicit budgets: a cap on total tokens, a cap per step, prompt caching, context pruning, routing cheap sub-tasks to smaller models, and hard stops when budgets are exceeded. Without these, monitoring comes from the finance team after the fact.
Protocol facts
- Sponsor
- open community
- Status
- stable
- Interop with
- Anthropic prompt caching, OpenAI batch API, Helicone, LangSmith
Frequently asked questions
What's the single biggest lever?
Prompt caching. For agent loops that reuse the same system prompt and tool definitions across many turns, caching cuts input-token cost by 80-90% on supported providers (Anthropic, Google, OpenAI).
How should budget enforcement work?
Two-tier: a soft warning (e.g., 50% of budget) that nudges the agent to summarize and conclude, and a hard stop at 100% that terminates the run and surfaces a human-readable checkpoint.
What about routing easy steps to cheaper models?
Three-tier routing — WASM/local for trivial transforms, Haiku/mini for simple steps, Sonnet/Opus for reasoning — can reduce average cost per task by an order of magnitude without meaningful quality loss.
Sources
- Anthropic — prompt caching — accessed 2026-04-20
- OpenAI — batch API — accessed 2026-04-20