Creativity · Agent Protocol

Agent Retry-with-Backoff Pattern

LLM agent tool calls fail constantly — rate limits, timeouts, 5xx from upstream APIs, flaky scraping targets. The retry-with-backoff pattern handles this gracefully: catch the error, classify it as transient vs. permanent, wait with exponential backoff and jitter, then try again up to a budget. Missing this pattern is the single biggest reason prototypes break in production.

Protocol facts

Sponsor: Distributed systems community
Status: stable
Interop with: tenacity, LangChain retry, AWS SDK retry policies

Frequently asked questions

What counts as a transient error?

HTTP 429 (rate limit), 502/503/504, network timeouts, and some 500s. Permanent errors like 400 (bad request) or 401 (unauthorized) should not be retried — they'll fail identically.

Why add jitter?

Without jitter, N concurrent agents that hit the same rate limit retry in lockstep, creating thundering-herd waves. Random jitter (e.g., ±25%) spreads retries across time.

How many retries?

3–5 is typical for user-facing work. For background agents you can go higher, but always cap the total wait time so a hung upstream doesn't permanently stall the agent.

Sources

AWS — exponential backoff and jitter — accessed 2026-04-20
tenacity library — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related