Creativity · Agent Protocol

Agent State and Checkpointing

A long-running agent that crashes halfway through a task and starts over has cost the user real money. The production pattern is explicit state: persist plan, step history, intermediate tool outputs, and memory at every step; resume from the latest checkpoint on failure; and expose checkpoints to users for inspection, branching, and time-travel. LangGraph, Claude Agent SDK, Temporal, and Inngest all provide primitives for this.

Protocol facts

Sponsor: open community
Status: stable
Interop with: LangGraph, Temporal, Inngest, Claude Agent SDK

Frequently asked questions

What goes in a checkpoint?

At minimum: plan state, last completed step index, accumulated messages, tool-call results, and scratchpad memory. Some systems also snapshot sandbox filesystem state for true resume-anywhere semantics.

Why not just re-run from scratch?

Re-running wastes tokens, is non-deterministic, and can't tolerate human mid-run input. Checkpointing is what makes hours-long agent runs practical.

Is durable execution (Temporal, Inngest) overkill?

For a toy demo, yes. For any agent that does work the user would mind losing — emails sent, rows inserted, money moved — durable execution is the right substrate and LangGraph-style in-process checkpointing is the lighter cousin.

Sources

LangGraph — persistence — accessed 2026-04-20
Temporal for AI agents — accessed 2026-04-20

Protocol facts

Frequently asked questions

Sources

Related