Curiosity · AI Model
Prompt Guard 2
Prompt Guard 2 is Meta's updated prompt-injection classifier, released alongside the Llama 4 Guard suite. It is a small (tens-of-millions of parameters) encoder model that flags user inputs containing jailbreak patterns or indirect injections embedded in retrieved documents, before they reach your main LLM.
Model specs
- Vendor
- Meta
- Family
- Llama Guard
- Released
- 2025-04
- Context window
- 512 tokens
- Modalities
- text
Strengths
- Very small and fast — cheap to run on every request
- Open weights allow inspection and fine-tuning
- Improved coverage of indirect/data-borne injections over v1
Limitations
- Cannot catch novel attacks not represented in training data
- English-biased — lower performance on some languages
- Is a classifier, not a replacement for system-prompt hardening
Use cases
- Pre-filtering untrusted inputs in RAG and agent pipelines
- Flagging indirect injections in retrieved web content
- Safety telemetry in enterprise LLM platforms
- Education on prompt-injection defence
Benchmarks
| Benchmark | Score | As of |
|---|---|---|
| Prompt-injection detection F1 | high across multiple public benchmarks | 2025-04 |
Frequently asked questions
What is Prompt Guard 2?
Prompt Guard 2 is Meta's open-weights classifier that detects prompt-injection and jailbreak patterns in text inputs, designed to run as a lightweight safety sidecar in front of any LLM.
Does Prompt Guard 2 replace a system prompt?
No — it complements system-prompt hardening. Prompt Guard 2 flags suspicious inputs, but robust LLM deployments still need strict system prompts, tool-use scoping, and output filtering.
How is Prompt Guard 2 different from Llama Guard?
Llama Guard classifies content against a broad hazards taxonomy. Prompt Guard 2 specialises in detecting injection and jailbreak attempts in the user or document side of the input.
Sources
- Hugging Face — meta-llama/Prompt-Guard-2 — accessed 2026-04-20
- Meta — Llama Guard suite — accessed 2026-04-20