Creativity · Comparison
Few-Shot Prompting vs Fine-Tuning
Few-shot prompting and fine-tuning both customize an LLM to your task, but they do it at opposite ends of the training/inference spectrum. Few-shot puts 2-20 demonstrations directly in the prompt and lets the model generalize. Fine-tuning actually updates model weights on thousands of examples. The choice shapes your whole pipeline: cost, latency, iteration speed, and what kinds of improvement are possible.
Side-by-side
| Criterion | Few-Shot Prompting | Fine-Tuning |
|---|---|---|
| Setup time | Minutes — write prompts | Hours to days — data + training + eval |
| Training data required | ~5-20 examples | ~500-10,000+ high-quality examples |
| Per-request cost | Higher — examples sit in every prompt | Lower per request (after training) |
| Iteration speed | Minutes — edit prompt and rerun | Hours — retrain, reevaluate, redeploy |
| Works with closed API models | Yes — any model | Only where vendor allows (OpenAI, Bedrock, Vertex) |
| Latency impact | Slight — longer prompt = longer TTFT | No extra latency vs base model |
| Style / tone conformance | Good with careful examples | Excellent — deeper imprint on style |
| Handles distribution shift | Edit examples and rerun | Retrain on new data |
| Best fit | Most tasks, especially with frontier models | High-volume production, specific style/format, narrow domain |
Verdict
In 2026, few-shot prompting should be the default — frontier models (Claude Opus 4.7, GPT-5, Gemini 2.5 Pro) are so strong at in-context learning that fine-tuning rarely beats a well-crafted few-shot prompt. The exception is high-volume production where per-request cost dominates, or narrow style/format tasks where you need the model to absolutely produce a specific output pattern. Fine-tuning also shines for smaller open-weights models — fine-tuning a 7B model on your task can beat few-shot-ing a frontier model for cost-sensitive workloads. Rule of thumb: start with few-shot, measure, fine-tune only when few-shot plateaus and volume justifies it.
When to choose each
Choose Few-Shot Prompting if…
- You're early in development and need to iterate fast.
- Your task benefits from frontier-model reasoning.
- You have fewer than a few hundred examples available.
- You don't want to manage training data and ops.
Choose Fine-Tuning if…
- You have 1000+ labeled examples and a stable task definition.
- High volume makes per-request cost matter — saving tokens by not including examples.
- You need the model to absolutely conform to a specific format/style.
- You're working with a small open-weights model that benefits from task adaptation.
Frequently asked questions
Isn't RAG better than both?
RAG solves a different problem — it provides external knowledge. Few-shot and fine-tuning shape how the model behaves. In production you often use all three: RAG for facts, few-shot for task structure, fine-tuning for format conformance.
How many examples for few-shot?
Typically 3-10. More is sometimes better but hits diminishing returns quickly with strong base models. What matters more than count is diversity — cover edge cases and failure modes, not just canonical cases.
Does prompt caching change the economics?
Yes. If your few-shot examples are stable, prompt caching (Anthropic, Google, OpenAI) dramatically reduces per-request cost — sometimes 90% off on the cached portion. This tips many decisions toward few-shot over fine-tuning.
Sources
- OpenAI — Prompt engineering guide — accessed 2026-04-20
- Anthropic — Fine-tuning overview — accessed 2026-04-20