Creativity · Comparison

Few-Shot Prompting vs Fine-Tuning

Few-shot prompting and fine-tuning both customize an LLM to your task, but they do it at opposite ends of the training/inference spectrum. Few-shot puts 2-20 demonstrations directly in the prompt and lets the model generalize. Fine-tuning actually updates model weights on thousands of examples. The choice shapes your whole pipeline: cost, latency, iteration speed, and what kinds of improvement are possible.

Side-by-side

Criterion	Few-Shot Prompting	Fine-Tuning
Setup time	Minutes — write prompts	Hours to days — data + training + eval
Training data required	~5-20 examples	~500-10,000+ high-quality examples
Per-request cost	Higher — examples sit in every prompt	Lower per request (after training)
Iteration speed	Minutes — edit prompt and rerun	Hours — retrain, reevaluate, redeploy
Works with closed API models	Yes — any model	Only where vendor allows (OpenAI, Bedrock, Vertex)
Latency impact	Slight — longer prompt = longer TTFT	No extra latency vs base model
Style / tone conformance	Good with careful examples	Excellent — deeper imprint on style
Handles distribution shift	Edit examples and rerun	Retrain on new data
Best fit	Most tasks, especially with frontier models	High-volume production, specific style/format, narrow domain

Verdict

In 2026, few-shot prompting should be the default — frontier models (Claude Opus 4.7, GPT-5, Gemini 2.5 Pro) are so strong at in-context learning that fine-tuning rarely beats a well-crafted few-shot prompt. The exception is high-volume production where per-request cost dominates, or narrow style/format tasks where you need the model to absolutely produce a specific output pattern. Fine-tuning also shines for smaller open-weights models — fine-tuning a 7B model on your task can beat few-shot-ing a frontier model for cost-sensitive workloads. Rule of thumb: start with few-shot, measure, fine-tune only when few-shot plateaus and volume justifies it.

When to choose each

Choose Few-Shot Prompting if…

You're early in development and need to iterate fast.
Your task benefits from frontier-model reasoning.
You have fewer than a few hundred examples available.
You don't want to manage training data and ops.

Choose Fine-Tuning if…

You have 1000+ labeled examples and a stable task definition.
High volume makes per-request cost matter — saving tokens by not including examples.
You need the model to absolutely conform to a specific format/style.
You're working with a small open-weights model that benefits from task adaptation.

Frequently asked questions

Isn't RAG better than both?

RAG solves a different problem — it provides external knowledge. Few-shot and fine-tuning shape how the model behaves. In production you often use all three: RAG for facts, few-shot for task structure, fine-tuning for format conformance.

How many examples for few-shot?

Typically 3-10. More is sometimes better but hits diminishing returns quickly with strong base models. What matters more than count is diversity — cover edge cases and failure modes, not just canonical cases.

Does prompt caching change the economics?

Yes. If your few-shot examples are stable, prompt caching (Anthropic, Google, OpenAI) dramatically reduces per-request cost — sometimes 90% off on the cached portion. This tips many decisions toward few-shot over fine-tuning.

Sources

OpenAI — Prompt engineering guide — accessed 2026-04-20
Anthropic — Fine-tuning overview — accessed 2026-04-20