Curiosity · AI Model

Prompt Guard 2

Prompt Guard 2 is Meta's updated prompt-injection classifier, released alongside the Llama 4 Guard suite. It is a small (tens-of-millions of parameters) encoder model that flags user inputs containing jailbreak patterns or indirect injections embedded in retrieved documents, before they reach your main LLM.

Model specs

Vendor
Meta
Family
Llama Guard
Released
2025-04
Context window
512 tokens
Modalities
text

Strengths

  • Very small and fast — cheap to run on every request
  • Open weights allow inspection and fine-tuning
  • Improved coverage of indirect/data-borne injections over v1

Limitations

  • Cannot catch novel attacks not represented in training data
  • English-biased — lower performance on some languages
  • Is a classifier, not a replacement for system-prompt hardening

Use cases

  • Pre-filtering untrusted inputs in RAG and agent pipelines
  • Flagging indirect injections in retrieved web content
  • Safety telemetry in enterprise LLM platforms
  • Education on prompt-injection defence

Benchmarks

BenchmarkScoreAs of
Prompt-injection detection F1high across multiple public benchmarks2025-04

Frequently asked questions

What is Prompt Guard 2?

Prompt Guard 2 is Meta's open-weights classifier that detects prompt-injection and jailbreak patterns in text inputs, designed to run as a lightweight safety sidecar in front of any LLM.

Does Prompt Guard 2 replace a system prompt?

No — it complements system-prompt hardening. Prompt Guard 2 flags suspicious inputs, but robust LLM deployments still need strict system prompts, tool-use scoping, and output filtering.

How is Prompt Guard 2 different from Llama Guard?

Llama Guard classifies content against a broad hazards taxonomy. Prompt Guard 2 specialises in detecting injection and jailbreak attempts in the user or document side of the input.

Sources

  1. Hugging Face — meta-llama/Prompt-Guard-2 — accessed 2026-04-20
  2. Meta — Llama Guard suite — accessed 2026-04-20