Curiosity · Concept
Prompt Caching
Prompt caching saves the attention state (KV cache) computed from a long, stable prefix — system prompt, tool definitions, few-shot examples, a large document — and reuses it across requests. When the same prefix appears again, the model skips prefill for those tokens. Anthropic, OpenAI, and Google all offer this, typically cutting per-token input cost by 5-10x and latency by 30-80% on long prompts.
Quick reference
- Proficiency
- Beginner
- Also known as
- prefix caching, KV cache reuse, context caching
- Prerequisites
- LLM API basics, Context window
Frequently asked questions
What is prompt caching?
A feature where the provider stores the KV-cache of a stable prompt prefix and reuses it across requests. Your next call with the same prefix skips the prefill computation for those tokens and is billed at a steep discount, with lower latency.
When should I use prompt caching?
Whenever you repeatedly send the same long prefix — a big system prompt, tool definitions, a knowledge-base document, codebase context, or long few-shot examples — followed by varying user messages. It's a near-free win for chatbots with rich system prompts or RAG-style workflows that reuse context.
How is it priced?
Anthropic charges ~1.25x normal input for the first write and ~0.1x for reads. OpenAI discounts cached input tokens by about 50%. Google's Gemini has a separate context caching API with hourly storage + cheap reads. Exact numbers change — check provider docs, but the economics favor heavy reuse.
What invalidates a cache?
Any change to the prefix — even a single-token edit — invalidates everything after the change, because subsequent KV values depend on the prefix. Put static content first (system prompt, docs, tools) and varying content (user messages) last to maximize hit rate.
Sources
- Anthropic — Prompt caching — accessed 2026-04-20
- OpenAI — Prompt caching — accessed 2026-04-20
- Google — Context caching (Gemini API) — accessed 2026-04-20