Capability · Comparison
Closed API vs Self-Hosted LLM
The choice between a closed-API LLM (GPT-5, Claude Opus 4.7, Gemini 2.5 Pro) and a self-hosted open-weight model (Llama 3.3, Qwen 3, DeepSeek V3) is a trade-off in a specific workload, not ideology. Closed APIs give frontier quality and zero ops; self-hosted gives data control, predictable cost at scale, and customisation. Serious teams use both — frontier for hard calls, open-weight for volume.
Side-by-side
| Criterion | Closed API LLM | Self-Hosted LLM |
|---|---|---|
| Peak capability (as of 2026-04) | Highest — frontier labs | Close on many tasks, behind on frontier reasoning |
| Ops burden | Zero — call an API | GPU provisioning, inference stack, monitoring |
| Data control | Data leaves your environment (enterprise tiers mitigate) | Full — data never leaves |
| Cost at low volume | Cheapest | Expensive — GPU minimums |
| Cost at high volume | Per-token cost adds up | Predictable, can be much cheaper per request |
| Customisation (fine-tune, LoRA) | Limited to provider offerings | Full access to weights |
| Latency control | Subject to provider | Fully in your control (co-locate with app) |
| Compliance / sovereignty | Depends on provider region / SOC2 / DPAs | Fully under your control |
Verdict
Choose closed APIs when you need frontier-level quality, your volume is moderate, and you can accept standard enterprise data terms. Choose self-hosted when data sovereignty is non-negotiable, your volume is high enough to amortise GPU costs, or you need to fine-tune on proprietary data. The most resilient production stacks route per request: closed-API for high-value or hard reasoning, self-hosted for bulk and privacy-sensitive paths. Framing this as a binary usually costs you either quality or money.
When to choose each
Choose Closed API LLM if…
- You need frontier reasoning or coding quality.
- Your volume is low-to-medium and per-token pricing is fine.
- You don't have GPU ops expertise.
- Enterprise contracts with your provider satisfy compliance.
Choose Self-Hosted LLM if…
- Data cannot leave your environment (regulated industry, sovereignty).
- Your volume justifies a dedicated inference stack.
- You want to fine-tune or customise deeply.
- You need latency co-located with your app.
Frequently asked questions
At what volume does self-hosting become cheaper?
Very workload-dependent, but typically self-hosting an open-weight 70B model starts to beat mid-tier closed APIs somewhere around 50-200M tokens per day if you utilise the GPUs. For small/mid models the break-even is lower.
Can I get privacy from closed APIs?
Enterprise tiers from OpenAI, Anthropic, and Google offer no-training, data-retention commitments, and private endpoints (Azure OpenAI, Bedrock, Vertex). For most commercial use this is sufficient; for classified or strict residency workloads, self-hosting is the only option.
What's the fastest path to hybrid?
Put a model router in front of a provider SDK that speaks both OpenAI and a self-hosted endpoint (vLLM, SGLang, or a hosted version like Together/Fireworks). Route by request type or confidence.
Sources
- Anthropic — Enterprise privacy — accessed 2026-04-20
- vLLM — accessed 2026-04-20