Creativity · Agent Protocol

Agent State and Checkpointing

A long-running agent that crashes halfway through a task and starts over has cost the user real money. The production pattern is explicit state: persist plan, step history, intermediate tool outputs, and memory at every step; resume from the latest checkpoint on failure; and expose checkpoints to users for inspection, branching, and time-travel. LangGraph, Claude Agent SDK, Temporal, and Inngest all provide primitives for this.

Protocol facts

Sponsor
open community
Status
stable
Interop with
LangGraph, Temporal, Inngest, Claude Agent SDK

Frequently asked questions

What goes in a checkpoint?

At minimum: plan state, last completed step index, accumulated messages, tool-call results, and scratchpad memory. Some systems also snapshot sandbox filesystem state for true resume-anywhere semantics.

Why not just re-run from scratch?

Re-running wastes tokens, is non-deterministic, and can't tolerate human mid-run input. Checkpointing is what makes hours-long agent runs practical.

Is durable execution (Temporal, Inngest) overkill?

For a toy demo, yes. For any agent that does work the user would mind losing — emails sent, rows inserted, money moved — durable execution is the right substrate and LangGraph-style in-process checkpointing is the lighter cousin.

Sources

  1. LangGraph — persistence — accessed 2026-04-20
  2. Temporal for AI agents — accessed 2026-04-20