AI for SRE#llm#ai#sre#incident-response

LLMs for SRE Teams: Real Use Cases, Not Hype

SREs are rightly skeptical of AI hype. This guide cuts through it: here are the six things LLMs are genuinely good at in SRE contexts, the things they are not, and what the economics of AI in incident response actually look like.

OO

ObservabilityOS Team

Core Engineering & DevRel

July 13, 20269 min read

The SRE's Legitimate Skepticism

SREs have good reasons to distrust AI claims. Production systems operate under constraints that most AI demonstrations ignore: reliability requirements, latency budgets, regulatory constraints, and the cost of getting it wrong. An LLM that confabulates a plausible-sounding root cause during a P1 incident does not just waste time — it actively makes the incident worse by directing investigation down the wrong path.

The useful question is not 'can LLMs help SRE teams?' but 'what specific tasks in SRE work are LLMs demonstrably better at than humans, under production conditions, with acceptable error rates?' The answer to that narrower question is both specific and valuable — there are genuine wins available, but they require understanding what LLMs are and are not capable of.

This guide is opinionated. It will tell you which use cases work and which don't based on empirical observation from running AI models in production SRE workflows. If you're looking for hype, this is the wrong article.

What LLMs Are Actually Good At

LLMs excel at pattern matching and synthesis in text: identifying recurring themes across hundreds of log lines, connecting terminology from different contexts (a git commit message and an error log), and producing coherent prose from structured data. These are capabilities that complement rather than replace SRE expertise.

The four structural advantages LLMs bring to SRE work: they are tireless (they will read 10,000 log lines without attention fatigue), they have broad technical knowledge (they have been trained on millions of engineering blog posts, Stack Overflow answers, and technical documentation), they produce structured output on demand (JSON, markdown, templates), and they are increasingly cheap ($0.001–$0.01 per task at current model prices).

Use Case 1: Log Pattern Summarization

Given 200 log lines from a misbehaving service, an LLM can produce a 3-sentence summary of the error patterns, their frequency, and their apparent cause — in under 3 seconds. A human engineer reading the same 200 lines takes 5–10 minutes and is more likely to fixate on the first interesting error they see rather than patterns across the full set.

This works because log summarization is exactly the kind of task LLMs are structurally good at: pattern recognition in semi-structured text, with a clear summarization goal. The output quality is high enough to be the first step in incident investigation — not the last — and it compresses the data gathering phase of incident response significantly.

typescript
// Log summarization prompt structure used by ObservabilityOS
const summarizeLogsPrompt = (logs: string[]) => `
Analyze these ${logs.length} log entries from a production incident.

LOGS:
${logs.join('\n')}

Produce a JSON response with:
{
  "dominant_error": "the most common error pattern",
  "error_count": number of distinct error types,
  "timeline": "when errors started, any escalation pattern",
  "affected_components": ["list", "of", "services"],
  "key_patterns": ["up to 3 notable patterns"],
  "suggested_investigation": "what to look at next"
}
`;

Use Case 2: Incident Narrative Generation

During an active incident, every 30 minutes the on-call engineer needs to communicate status to stakeholders: what's happening, what's been tried, what the current theory is, and what the ETA to resolution is. Writing these updates while actively debugging is context-switching overhead that degrades both the debugging and the communication.

LLMs can draft these status updates from structured incident data: the current alert state, the investigation steps taken so far, and the current hypothesis. The engineer reviews and sends in 60 seconds instead of writing from scratch in 5 minutes. Over a 2-hour incident, this saves 15–20 minutes of cognitive load at the worst possible time.

Use Case 3: Runbook Drafting from Incident Data

The most persistent SRE problem: runbooks that are incomplete, outdated, or nonexistent. Writing a runbook requires an engineer to sit down and document a procedure — something that almost never happens after the incident is resolved because the team is already behind on sprint work.

An LLM can draft a runbook from an incident timeline: the detection method, the investigation steps taken, the resolution action, and the prevention action. It won't be perfect — it requires review and refinement — but a 70% draft that takes 5 minutes to complete beats a blank page indefinitely. ObservabilityOS generates runbook drafts automatically from completed post-mortems.

What LLMs Cannot Do (And Why It Matters)

LLMs cannot reliably diagnose incidents they have never seen patterns of before. Novel failure modes — new infrastructure behaviors, zero-day vulnerabilities, complex multi-system race conditions — will produce confabulated diagnoses that sound plausible but are wrong. For novel incidents, LLMs are useful for data gathering and summary, not for diagnosis.

LLMs cannot do arithmetic reliably. Do not ask an LLM to calculate error rates, compare numbers, or perform any computation that matters. Use code for computation and LLMs for language. LLMs also cannot access real-time data — they do not know your current system state unless you provide it explicitly in the context.

Auto-remediation based on LLM recommendations is not ready for production. The error rate for incorrect LLM diagnoses — while much lower than random — is still non-zero, and an incorrect automated remediation can be catastrophically worse than the original incident. Keep humans in the loop for any action that modifies production state.

The Economics: Cost Per Incident

At current model pricing, an AI-assisted incident analysis costs approximately $0.002–$0.01 per incident using Claude Haiku or GPT-4o-mini. For a team with 200 incidents per month, that's $0.40–$2.00 in LLM costs. Even at 10x that scale, AI RCA remains an order of magnitude cheaper than the engineer time it replaces.

The compounding value: every incident that gets a proper AI-assisted post-mortem produces data that improves the next analysis. Teams running AI-assisted incident response for 6+ months report 40–60% reductions in MTTR as the system learns their specific infrastructure patterns and the engineers learn to use AI context more effectively.

Stop debugging production in the dark

ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.

About the Author

OO

ObservabilityOS Team

Core Engineering & DevRel

The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.