Curiosity · AI Model

Prompt Guard 2

Prompt Guard 2 is Meta's updated prompt-injection classifier, released alongside the Llama 4 Guard suite. It is a small (tens-of-millions of parameters) encoder model that flags user inputs containing jailbreak patterns or indirect injections embedded in retrieved documents, before they reach your main LLM.

Model specs

Vendor: Meta
Family: Llama Guard
Released: 2025-04
Context window: 512 tokens
Modalities: text

Strengths

Very small and fast — cheap to run on every request
Open weights allow inspection and fine-tuning
Improved coverage of indirect/data-borne injections over v1

Limitations

Cannot catch novel attacks not represented in training data
English-biased — lower performance on some languages
Is a classifier, not a replacement for system-prompt hardening

Use cases

Pre-filtering untrusted inputs in RAG and agent pipelines
Flagging indirect injections in retrieved web content
Safety telemetry in enterprise LLM platforms
Education on prompt-injection defence

Benchmarks

Benchmark	Score	As of
Prompt-injection detection F1	high across multiple public benchmarks	2025-04

Frequently asked questions

What is Prompt Guard 2?

Prompt Guard 2 is Meta's open-weights classifier that detects prompt-injection and jailbreak patterns in text inputs, designed to run as a lightweight safety sidecar in front of any LLM.

Does Prompt Guard 2 replace a system prompt?

No — it complements system-prompt hardening. Prompt Guard 2 flags suspicious inputs, but robust LLM deployments still need strict system prompts, tool-use scoping, and output filtering.

How is Prompt Guard 2 different from Llama Guard?

Llama Guard classifies content against a broad hazards taxonomy. Prompt Guard 2 specialises in detecting injection and jailbreak attempts in the user or document side of the input.

Sources

Hugging Face — meta-llama/Prompt-Guard-2 — accessed 2026-04-20
Meta — Llama Guard suite — accessed 2026-04-20