Contribution · Application — Software
AI for Observability and Root Cause Analysis
Incidents wake engineers at 3 AM to spelunk through logs, traces, and metrics across a dozen services. LLMs with tool access to observability systems can summarize the incident, correlate signals, propose probable root causes, and even draft the postmortem draft. The pattern is agentic: the LLM calls out to Datadog, Grafana, and the deployment system, narrating its reasoning. The human on-call engineer confirms, acts, and owns the call.
Application facts
- Domain
- Software
- Subdomain
- SRE
- Example stack
- Claude Opus 4.7 with tool use for multi-source correlation · Datadog, Grafana, Honeycomb, New Relic via API tools · LangGraph agent with budget + timeout · Vector search over past incident postmortems · PagerDuty / Opsgenie integration
Data & infrastructure needs
- Structured logs with consistent tagging
- Distributed traces (OpenTelemetry)
- Metrics and SLO definitions
- Deployment events and feature-flag history
- Historical postmortem library
Risks & considerations
- Plausible but wrong root causes leading engineers astray
- Sensitive data in logs exposed to cloud LLMs
- Prompt injection via user-supplied log content
- Over-reliance degrading engineer skill
- Cost — agentic loops can rack up LLM bills fast
Frequently asked questions
Is AI for SRE root cause analysis safe?
As a diagnostic copilot, yes — it dramatically cuts MTTR on routine incidents. Never let it auto-rollback or auto-failover without human confirmation. Scrub logs for PII and secrets before they hit a cloud LLM, or self-host.
What LLM is best for observability?
Claude Opus 4.7 with extended thinking and tool use is excellent for multi-step diagnosis. GPT-5 with code-interpreter works well for metric analysis. For cost, route simple incidents to Sonnet/Haiku and escalate complex ones.
Regulatory concerns?
DPDPA/GDPR — logs often contain PII. Self-host the LLM or use strong DPAs. Financial services: SOX/RBI on change management; any rollback the AI proposes needs the human change-manager. Healthcare: HIPAA on log content.
Sources
- OpenTelemetry — accessed 2026-04-20
- Google SRE Book — accessed 2026-04-20