Contribution · Application — Software Engineering
AI Incident Response and On-Call Copilot
An on-call page at 3am is the hardest cognitive task an engineer faces: stack of alerts, 20 dashboards, runbook buried in Notion, Slack filling with stakeholders. LLM copilots (PagerDuty AIOps, Rootly AI, Shoreline, Traceroot, custom Claude / GPT-5 deployments) correlate signals, query telemetry, summarize Slack threads, draft status updates, and suggest mitigations. Best designs keep destructive actions (restart, rollback, kill) behind explicit human confirmation — AI proposes, humans dispose.
Application facts
- Domain
- Software Engineering
- Subdomain
- Site Reliability Engineering
- Example stack
- Claude Sonnet 4.6 for reasoning and status-update drafting · LangGraph agent with tools for Grafana / Prometheus / Loki queries · pgvector RAG over runbooks and postmortem archives · PagerDuty / Opsgenie API for alert context · Slack bolt SDK for status updates and human confirmation UI
Data & infrastructure needs
- Observability telemetry — metrics, logs, traces
- Alert history with resolution outcomes
- Runbook library (with version tracking)
- Service dependency graph / topology
- Post-incident review archives for precedent retrieval
Risks & considerations
- Hallucinated runbook steps worsening active incidents
- Prompt injection via adversarial log entries
- Secrets leakage if telemetry hits third-party APIs
- Automation bias under cognitive load
- Destructive actions if tool-use permissions too broad
Frequently asked questions
Should AI autonomously fix incidents?
No — especially not for multi-tenant or customer-facing systems. AI can recommend known runbook steps, but execution of destructive or customer-impacting actions should require human confirmation. Google SRE and AWS Well-Architected guidance both stress human-in-the-loop for blast-radius operations.
Which model works best for incident response?
Claude Sonnet 4.6 and GPT-5 are common; Claude Opus 4.7 (1M context) excels at postmortem drafting from long Slack threads. For cost, Haiku 4.5 and GPT-5-mini handle routine correlation. PagerDuty's native AI uses a blend of smaller models for real-time correlation.
What are the risks?
Hallucinated runbook steps that worsen incidents, prompt injection via log content from attackers, secrets leakage if logs sent to third-party APIs contain keys, and automation bias — tired on-calls trusting wrong suggestions. Mitigation: runbook RAG only, secrets scanning on telemetry, mandatory confirmation gates.
Sources
- Google SRE Book — Incident Response — accessed 2026-04-20
- PagerDuty — Incident Response Ops Guide — accessed 2026-04-20
- NIST SP 800-61 — Computer Security Incident Handling — accessed 2026-04-20