Why Traditional RCA Fails at 3 AM
Traditional incident response requires the on-call engineer to manually correlate information from multiple systems: logs from multiple services, infrastructure metrics, recent deploys, error tracking events, and historical context from previous incidents. Under pressure, at odd hours, with incomplete context, humans make poor correlations. The average MTTI (mean time to investigate) for complex incidents is 45–90 minutes — most of it spent gathering data, not analyzing it.
The information needed to diagnose most incidents is already there. Your logs recorded the anomaly. Your deploy system recorded the change that caused it. Your error tracker captured the stack trace. The problem is not data availability — it's data correlation at speed. AI excels at exactly this: ingesting large volumes of heterogeneous data and producing structured, reasoned analysis in seconds.
AI root cause analysis is not magic — it is a disciplined engineering pipeline that gathers the right context, structures it correctly, and submits it to an LLM with carefully engineered prompts. The quality of the output is directly proportional to the quality of the context and the prompt.
The Context Pipeline: What the LLM Actually Receives
An LLM analyzing a production incident receives a structured context window containing: (1) the anomaly detection event with its Z-score and metric values, (2) 50–200 relevant log lines from the affected service, time-windowed around the anomaly, (3) recent deploy events within the previous 24 hours, including git commit messages and diff summaries, (4) correlated errors from other services that spiked simultaneously, and (5) historical context — similar incidents from the past 30 days with their resolutions.
The context must be assembled in priority order. LLMs have fixed context windows, and the most relevant information must appear early. Token budget allocation for a typical incident context: 20% for system prompt and instructions, 30% for logs, 20% for deploy/commit context, 15% for correlated service data, 15% for historical precedents.
Pre-processing matters enormously. Raw log lines contain timestamps, service prefixes, and structured JSON that an LLM can parse, but verbatim log data is expensive in tokens and often includes irrelevant fields. ObservabilityOS pre-processes logs to extract the signal: error messages, exception types, relevant field values, and timing relationships. This compression often reduces the token count by 60–70% while preserving the diagnostic signal.
Prompt Engineering for Incident Reasoning
The system prompt for an AI RCA tool has three responsibilities: (1) define the output format so the response is machine-parseable, (2) establish the reasoning approach (what to look for, how to connect evidence), and (3) calibrate confidence communication.
Chain-of-thought (CoT) prompting dramatically improves diagnostic accuracy. Rather than asking for a conclusion directly, the prompt instructs the model to reason step by step: first summarize the observed anomaly, then list all evidence sorted by strength, then identify the most probable cause and explain why, then list alternative causes and why they're less likely, then suggest remediation steps.
// Simplified version of ObservabilityOS's RCA prompt structure
const systemPrompt = `
You are an expert SRE analyzing a production incident.
Reason step by step. Format your response as JSON.
STEP 1 — Summarize the anomaly (2 sentences max)
STEP 2 — List evidence ranked by diagnostic strength
STEP 3 — Identify the root cause with confidence (0.0-1.0)
STEP 4 — List alternative causes with confidence scores
STEP 5 — Suggest immediate remediation steps (ordered by priority)
Rules:
- Never state certainty above 0.95 — production systems are complex
- If evidence conflicts, lower your confidence and note the conflict
- Always correlate timing: causation requires temporal precedence
- Distinguish between direct causes and contributing factors
`;
const userPrompt = `
INCIDENT CONTEXT:
- Anomaly detected: 2026-07-05T03:47:12Z
- Service: checkout-api (Z-score: +4.8 on error_rate)
- Anomaly value: 23.4% error rate (baseline: 1.2%)
RELEVANT LOGS (last 50 errors):
MongoServerError: connection pool exhausted (timeout after 30000ms)
at checkout-api/src/db/mongo.ts:47
RECENT DEPLOYS:
- 2026-07-05T03:35:00Z: checkout-api v2.4.1 → v2.4.2
Commit: "Add eager loading of related products on checkout"
Changed files: checkout.service.ts, mongo.ts
CORRELATED ANOMALIES:
- inventory-service: query_time p99 spike (+340%) at 03:47
`;A Real Incident: AI Diagnoses MongoDB Connection Exhaustion
Here is an anonymized real incident from a customer's production system. The checkout service started returning 500 errors at 3:47 AM, waking the on-call engineer. Without AI RCA, the investigation would have involved manually searching logs, checking deploy history, and correlating timing — a 30–45 minute process.
With ObservabilityOS's AI RCA, the incident card was generated in 8 seconds:
{
"summary": "Checkout service connection pool exhaustion caused by a recent deploy that added N+1 queries to the product loading path, overwhelming MongoDB connection limits under normal load.",
"root_cause": {
"description": "Deploy v2.4.2 (03:35 UTC) introduced eager loading of related products, which executes one MongoDB query per cart item. A cart with 12 items now triggers 12 concurrent queries. The connection pool (max: 10) became exhausted within 12 minutes of the deploy.",
"confidence": 0.91,
"evidence": [
"MongoServerError: connection pool exhausted — 847 occurrences post-deploy",
"Deploy timing: errors begin exactly 12 min after v2.4.2 deployment",
"inventory-service query time spike: corroborates MongoDB contention",
"Changed file mongo.ts — connection pool config unchanged at max:10"
]
},
"alternatives": [
{
"cause": "MongoDB server resource exhaustion (CPU/Memory)",
"confidence": 0.06,
"reason": "No MongoDB server metrics indicate resource saturation"
}
],
"remediation": [
"IMMEDIATE: Revert to v2.4.1 (git revert 8f3a021)",
"SHORT-TERM: Increase connection pool max to 50 in mongo.ts",
"PERMANENT: Replace N+1 queries with a single aggregation pipeline"
]
}Confidence Scoring and When AI Gets It Wrong
AI RCA has a well-documented failure mode: when evidence is genuinely ambiguous, the model can produce a plausible-sounding but incorrect diagnosis with high confidence. This is the hallucination problem applied to incident response. The mitigation: calibrated confidence scoring and explicit uncertainty expression in the prompt.
A confidence score below 0.6 should route the incident to human investigation with AI as an assistant, not the lead. ObservabilityOS shows confidence scores prominently in the incident card and includes a 'Why the AI is uncertain' section when confidence is below 0.7. Engineers should treat AI RCA as one signal among several, especially for novel failure modes.
The system gets dramatically more accurate with more historical data. After 500 incidents, the model has seen your specific infrastructure's failure patterns, your team's most common root causes, and which types of evidence are most diagnostic for your stack. Fine-tuning a smaller model on this proprietary incident data is how ObservabilityOS builds a compounding technical moat over time.
Stop debugging production in the dark
ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.
About the Author
ObservabilityOS Team
Core Engineering & DevRel
The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.