Creativity · Agent Protocol

Agent Cost and Token Budget Patterns

Because agents loop, their token consumption is unbounded by default. A single runaway run has cost teams thousands of dollars. Production agents adopt explicit budgets: a cap on total tokens, a cap per step, prompt caching, context pruning, routing cheap sub-tasks to smaller models, and hard stops when budgets are exceeded. Without these, monitoring comes from the finance team after the fact.

Protocol facts

Sponsor: open community
Status: stable
Interop with: Anthropic prompt caching, OpenAI batch API, Helicone, LangSmith

Frequently asked questions

What's the single biggest lever?

Prompt caching. For agent loops that reuse the same system prompt and tool definitions across many turns, caching cuts input-token cost by 80-90% on supported providers (Anthropic, Google, OpenAI).

How should budget enforcement work?

Two-tier: a soft warning (e.g., 50% of budget) that nudges the agent to summarize and conclude, and a hard stop at 100% that terminates the run and surfaces a human-readable checkpoint.

What about routing easy steps to cheaper models?

Three-tier routing — WASM/local for trivial transforms, Haiku/mini for simple steps, Sonnet/Opus for reasoning — can reduce average cost per task by an order of magnitude without meaningful quality loss.

Sources

Anthropic — prompt caching — accessed 2026-04-20
OpenAI — batch API — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related