Curiosity · Concept
AI Safety Red-Teaming
Red-teaming borrows its name from military war-gaming: a 'red team' attacks while the 'blue team' defends. For frontier LLMs, red-teaming covers jailbreaks, prompt-injection attacks, biased or discriminatory outputs, dangerous-capability probes (bio, cyber, weapons), and agentic misuse. Anthropic, OpenAI, DeepMind, and NIST all run structured red-team programs, and automated red-teaming — where one LLM attacks another — has become an essential scaling mechanism alongside human testers.
Quick reference
- Proficiency
- Intermediate
- Also known as
- adversarial red-teaming, red team
- Prerequisites
- Prompt injection, Guardrails
Frequently asked questions
What is AI red-teaming?
Red-teaming is structured adversarial testing of an AI system. Testers (human, automated, or both) try to provoke harmful, unsafe, deceptive, or policy-violating outputs so the developer can fix weaknesses before public release.
What kinds of attacks do red-teamers try?
Jailbreaks (bypassing safety training), prompt injection via retrieved content or tools, requests for dangerous information (bio, chemical, cyber), bias and fairness probes, agentic misuse (planning harmful actions), and sycophancy / manipulation attacks.
Who red-teams frontier models?
Internal teams at labs like Anthropic, OpenAI, and Google DeepMind; external organisations such as NIST's AISI and the UK AISI; invited subject-matter experts (biologists, security researchers); and automated systems like Anthropic's automated red-teaming for attack generation at scale.
How does red-teaming relate to evaluations and guardrails?
Red-teaming surfaces failure modes. Evaluations measure how often those failures happen and track progress across model versions. Guardrails (classifiers, filters, rules) block known failure modes at runtime. The three are complementary parts of a safety pipeline.
Sources
- Perez et al. — Red Teaming Language Models with Language Models — accessed 2026-04-20
- Ganguli et al. — Red Teaming Language Models to Reduce Harms — accessed 2026-04-20