Creativity · Agent Protocol
Agent Retry-with-Backoff Pattern
LLM agent tool calls fail constantly — rate limits, timeouts, 5xx from upstream APIs, flaky scraping targets. The retry-with-backoff pattern handles this gracefully: catch the error, classify it as transient vs. permanent, wait with exponential backoff and jitter, then try again up to a budget. Missing this pattern is the single biggest reason prototypes break in production.
Protocol facts
- Sponsor
- Distributed systems community
- Status
- stable
- Interop with
- tenacity, LangChain retry, AWS SDK retry policies
Frequently asked questions
What counts as a transient error?
HTTP 429 (rate limit), 502/503/504, network timeouts, and some 500s. Permanent errors like 400 (bad request) or 401 (unauthorized) should not be retried — they'll fail identically.
Why add jitter?
Without jitter, N concurrent agents that hit the same rate limit retry in lockstep, creating thundering-herd waves. Random jitter (e.g., ±25%) spreads retries across time.
How many retries?
3–5 is typical for user-facing work. For background agents you can go higher, but always cap the total wait time so a hung upstream doesn't permanently stall the agent.
Sources
- AWS — exponential backoff and jitter — accessed 2026-04-20
- tenacity library — accessed 2026-04-20