Contribution · Application — Software Engineering

AI Incident Response and On-Call Copilot

An on-call page at 3am is the hardest cognitive task an engineer faces: stack of alerts, 20 dashboards, runbook buried in Notion, Slack filling with stakeholders. LLM copilots (PagerDuty AIOps, Rootly AI, Shoreline, Traceroot, custom Claude / GPT-5 deployments) correlate signals, query telemetry, summarize Slack threads, draft status updates, and suggest mitigations. Best designs keep destructive actions (restart, rollback, kill) behind explicit human confirmation — AI proposes, humans dispose.

Application facts

Domain
Software Engineering
Subdomain
Site Reliability Engineering
Example stack
Claude Sonnet 4.6 for reasoning and status-update drafting · LangGraph agent with tools for Grafana / Prometheus / Loki queries · pgvector RAG over runbooks and postmortem archives · PagerDuty / Opsgenie API for alert context · Slack bolt SDK for status updates and human confirmation UI

Data & infrastructure needs

  • Observability telemetry — metrics, logs, traces
  • Alert history with resolution outcomes
  • Runbook library (with version tracking)
  • Service dependency graph / topology
  • Post-incident review archives for precedent retrieval

Risks & considerations

  • Hallucinated runbook steps worsening active incidents
  • Prompt injection via adversarial log entries
  • Secrets leakage if telemetry hits third-party APIs
  • Automation bias under cognitive load
  • Destructive actions if tool-use permissions too broad

Frequently asked questions

Should AI autonomously fix incidents?

No — especially not for multi-tenant or customer-facing systems. AI can recommend known runbook steps, but execution of destructive or customer-impacting actions should require human confirmation. Google SRE and AWS Well-Architected guidance both stress human-in-the-loop for blast-radius operations.

Which model works best for incident response?

Claude Sonnet 4.6 and GPT-5 are common; Claude Opus 4.7 (1M context) excels at postmortem drafting from long Slack threads. For cost, Haiku 4.5 and GPT-5-mini handle routine correlation. PagerDuty's native AI uses a blend of smaller models for real-time correlation.

What are the risks?

Hallucinated runbook steps that worsen incidents, prompt injection via log content from attackers, secrets leakage if logs sent to third-party APIs contain keys, and automation bias — tired on-calls trusting wrong suggestions. Mitigation: runbook RAG only, secrets scanning on telemetry, mandatory confirmation gates.

Sources

  1. Google SRE Book — Incident Response — accessed 2026-04-20
  2. PagerDuty — Incident Response Ops Guide — accessed 2026-04-20
  3. NIST SP 800-61 — Computer Security Incident Handling — accessed 2026-04-20