Contribution · Application — Software

AI for Observability and Root Cause Analysis

Incidents wake engineers at 3 AM to spelunk through logs, traces, and metrics across a dozen services. LLMs with tool access to observability systems can summarize the incident, correlate signals, propose probable root causes, and even draft the postmortem draft. The pattern is agentic: the LLM calls out to Datadog, Grafana, and the deployment system, narrating its reasoning. The human on-call engineer confirms, acts, and owns the call.

Application facts

Domain
Software
Subdomain
SRE
Example stack
Claude Opus 4.7 with tool use for multi-source correlation · Datadog, Grafana, Honeycomb, New Relic via API tools · LangGraph agent with budget + timeout · Vector search over past incident postmortems · PagerDuty / Opsgenie integration

Data & infrastructure needs

  • Structured logs with consistent tagging
  • Distributed traces (OpenTelemetry)
  • Metrics and SLO definitions
  • Deployment events and feature-flag history
  • Historical postmortem library

Risks & considerations

  • Plausible but wrong root causes leading engineers astray
  • Sensitive data in logs exposed to cloud LLMs
  • Prompt injection via user-supplied log content
  • Over-reliance degrading engineer skill
  • Cost — agentic loops can rack up LLM bills fast

Frequently asked questions

Is AI for SRE root cause analysis safe?

As a diagnostic copilot, yes — it dramatically cuts MTTR on routine incidents. Never let it auto-rollback or auto-failover without human confirmation. Scrub logs for PII and secrets before they hit a cloud LLM, or self-host.

What LLM is best for observability?

Claude Opus 4.7 with extended thinking and tool use is excellent for multi-step diagnosis. GPT-5 with code-interpreter works well for metric analysis. For cost, route simple incidents to Sonnet/Haiku and escalate complex ones.

Regulatory concerns?

DPDPA/GDPR — logs often contain PII. Self-host the LLM or use strong DPAs. Financial services: SOX/RBI on change management; any rollback the AI proposes needs the human change-manager. Healthcare: HIPAA on log content.

Sources

  1. OpenTelemetry — accessed 2026-04-20
  2. Google SRE Book — accessed 2026-04-20