Contribution · Application — Software

AI for Observability and Root Cause Analysis

Incidents wake engineers at 3 AM to spelunk through logs, traces, and metrics across a dozen services. LLMs with tool access to observability systems can summarize the incident, correlate signals, propose probable root causes, and even draft the postmortem draft. The pattern is agentic: the LLM calls out to Datadog, Grafana, and the deployment system, narrating its reasoning. The human on-call engineer confirms, acts, and owns the call.

Application facts

Domain: Software
Subdomain: SRE
Example stack: Claude Opus 4.7 with tool use for multi-source correlation · Datadog, Grafana, Honeycomb, New Relic via API tools · LangGraph agent with budget + timeout · Vector search over past incident postmortems · PagerDuty / Opsgenie integration

Data & infrastructure needs

Structured logs with consistent tagging
Distributed traces (OpenTelemetry)
Metrics and SLO definitions
Deployment events and feature-flag history
Historical postmortem library

Risks & considerations

Plausible but wrong root causes leading engineers astray
Sensitive data in logs exposed to cloud LLMs
Prompt injection via user-supplied log content
Over-reliance degrading engineer skill
Cost — agentic loops can rack up LLM bills fast

Frequently asked questions

Is AI for SRE root cause analysis safe?

As a diagnostic copilot, yes — it dramatically cuts MTTR on routine incidents. Never let it auto-rollback or auto-failover without human confirmation. Scrub logs for PII and secrets before they hit a cloud LLM, or self-host.

What LLM is best for observability?

Claude Opus 4.7 with extended thinking and tool use is excellent for multi-step diagnosis. GPT-5 with code-interpreter works well for metric analysis. For cost, route simple incidents to Sonnet/Haiku and escalate complex ones.

Regulatory concerns?

DPDPA/GDPR — logs often contain PII. Self-host the LLM or use strong DPAs. Financial services: SOX/RBI on change management; any rollback the AI proposes needs the human change-manager. Healthcare: HIPAA on log content.

Sources

OpenTelemetry — accessed 2026-04-20
Google SRE Book — accessed 2026-04-20