Creativity · Agent Protocol
Agent State and Checkpointing
A long-running agent that crashes halfway through a task and starts over has cost the user real money. The production pattern is explicit state: persist plan, step history, intermediate tool outputs, and memory at every step; resume from the latest checkpoint on failure; and expose checkpoints to users for inspection, branching, and time-travel. LangGraph, Claude Agent SDK, Temporal, and Inngest all provide primitives for this.
Protocol facts
- Sponsor
- open community
- Status
- stable
- Interop with
- LangGraph, Temporal, Inngest, Claude Agent SDK
Frequently asked questions
What goes in a checkpoint?
At minimum: plan state, last completed step index, accumulated messages, tool-call results, and scratchpad memory. Some systems also snapshot sandbox filesystem state for true resume-anywhere semantics.
Why not just re-run from scratch?
Re-running wastes tokens, is non-deterministic, and can't tolerate human mid-run input. Checkpointing is what makes hours-long agent runs practical.
Is durable execution (Temporal, Inngest) overkill?
For a toy demo, yes. For any agent that does work the user would mind losing — emails sent, rows inserted, money moved — durable execution is the right substrate and LangGraph-style in-process checkpointing is the lighter cousin.
Sources
- LangGraph — persistence — accessed 2026-04-20
- Temporal for AI agents — accessed 2026-04-20