Curiosity · Concept

Guardrails (LLM Safety Layers)

A raw LLM is willing to do almost anything its prompt tells it to, which is a bad property for production. Guardrails sit on both sides of the model: input guardrails screen for prompt injection, jailbreaks, PII, or off-topic requests, and output guardrails check generated text against schema, policy, toxicity, and factuality constraints. Implementations range from simple regex and JSON-schema validators to dedicated models (NVIDIA NeMo Guardrails, Guardrails AI, Llama Guard, OpenAI Moderation). Defense in depth matters — one filter is never enough.

Quick reference

Proficiency: Intermediate
Also known as: LLM safety layers, content safety filters
Prerequisites: prompt injection, structured output

Frequently asked questions

What are LLM guardrails?

Guardrails are input and output validation layers around an LLM that block unsafe, off-topic, or malformed generations. Input guardrails screen the user prompt; output guardrails screen the model's response before it reaches users or downstream systems.

What should guardrails actually check?

Prompt injection and jailbreak attempts, PII in input or output, topical drift (staying on the assistant's scope), schema compliance for structured responses, toxicity and policy violations, and factual grounding against retrieval sources for RAG.

Guardrails vs RLHF/constitutional alignment?

RLHF and constitutional training shape the model's defaults at training time. Guardrails are runtime defenses that catch what slips through. You want both — training alone won't cover every policy, and runtime checks alone can't fix a model that happily outputs harmful content.

How do I avoid over-blocking?

Measure false-positive rate on realistic traffic, allow soft-fail (model warns user) versus hard-fail (fully blocked), tune thresholds per severity, and add human review for edge cases. Aggressive guardrails that refuse too much destroy product experience.

Sources

NVIDIA — NeMo Guardrails documentation — accessed 2026-04-20
Meta — Llama Guard 3 — accessed 2026-04-20

Quick reference

Frequently asked questions

Sources

Related