Creativity · Agent Protocol

Agent Retry-with-Backoff Pattern

LLM agent tool calls fail constantly — rate limits, timeouts, 5xx from upstream APIs, flaky scraping targets. The retry-with-backoff pattern handles this gracefully: catch the error, classify it as transient vs. permanent, wait with exponential backoff and jitter, then try again up to a budget. Missing this pattern is the single biggest reason prototypes break in production.

Protocol facts

Sponsor
Distributed systems community
Status
stable
Interop with
tenacity, LangChain retry, AWS SDK retry policies

Frequently asked questions

What counts as a transient error?

HTTP 429 (rate limit), 502/503/504, network timeouts, and some 500s. Permanent errors like 400 (bad request) or 401 (unauthorized) should not be retried — they'll fail identically.

Why add jitter?

Without jitter, N concurrent agents that hit the same rate limit retry in lockstep, creating thundering-herd waves. Random jitter (e.g., ±25%) spreads retries across time.

How many retries?

3–5 is typical for user-facing work. For background agents you can go higher, but always cap the total wait time so a hung upstream doesn't permanently stall the agent.

Sources

  1. AWS — exponential backoff and jitter — accessed 2026-04-20
  2. tenacity library — accessed 2026-04-20