Production Engineering#slo#sla#sli#error-budget

SLO vs SLA vs SLI: What Every Engineer Needs to Know

SLOs, SLAs, and SLIs are the vocabulary of production reliability. This guide explains the differences with real examples, shows you how to set meaningful targets, and explains error budgets and burn rate alerts that actually drive engineering decisions.

OO

ObservabilityOS Team

Core Engineering & DevRel

July 7, 20268 min read

The Vocabulary of Reliability: Getting the Terms Right

SLI, SLO, and SLA are layered concepts that build on each other. Understanding the relationship between them is what makes them useful rather than bureaucratic checkbox exercises. Many teams implement SLAs without SLOs, or SLOs without SLIs, and end up with reliability targets that are disconnected from what the system actually measures. Start from the bottom up: measure first (SLI), target second (SLO), promise third (SLA).

These concepts originated in Google's Site Reliability Engineering practice, documented in the SRE Book. They represent a shift from monitoring infrastructure metrics (CPU, memory, disk) to measuring service quality from the user's perspective. The question is not 'is my CPU below 80%?' but 'are my users getting the responses they expect, at the speed they expect, with the reliability they expect?'

SLIs: What You Measure

A Service Level Indicator is the actual measurement — the raw number that represents how your service is performing right now. A good SLI directly reflects user experience. Bad SLIs measure infrastructure health; good SLIs measure user-perceived quality.

The four canonical SLI types: Availability (the fraction of successful requests), Latency (the fraction of requests served within a threshold), Throughput (the rate of requests the system can handle), and Quality (a proxy for whether the response was correct, e.g., cache hit rate, search result relevance). For most web APIs, availability and latency SLIs are sufficient to start.

typescript
// Availability SLI: percentage of requests that succeed
// Definition: HTTP 2xx + 3xx responses / total requests
function calculateAvailabilitySLI(
  successfulRequests: number,
  totalRequests: number
): number {
  if (totalRequests === 0) return 1.0;
  return successfulRequests / totalRequests;
}

// Latency SLI: percentage of requests served within threshold
// Definition: requests under 200ms / total requests
function calculateLatencySLI(
  requestDurationsMs: number[],
  thresholdMs: number = 200
): number {
  if (requestDurationsMs.length === 0) return 1.0;
  const fast = requestDurationsMs.filter(d => d <= thresholdMs).length;
  return fast / requestDurationsMs.length;
}

SLOs: What You Target

A Service Level Objective is your internal reliability target — a commitment your engineering team makes to itself about how well the service should perform. An SLO is expressed as: SLI >= target over a time window. Example: 'The availability SLI will be >= 99.9% measured over a rolling 30-day window.'

Setting the right SLO target requires understanding what reliability users actually need, what your system is currently achieving, and what it would cost to improve. 99.9% availability (the 'three nines' target) allows 43.8 minutes of downtime per month. 99.99% (four nines) allows 4.4 minutes per month. The jump from 99.9% to 99.99% is not just a number change — it typically requires architectural changes (redundant infrastructure, zero-downtime deploys, circuit breakers) that can represent months of engineering work.

A common mistake is setting SLO targets at your current performance. SLOs should be slightly below your current measured performance — they represent the floor, not the ceiling. If you're currently achieving 99.97% availability, an SLO of 99.9% gives you room to experiment and take calculated risks with your error budget.

SLAs: What You Promise

A Service Level Agreement is a contractual commitment to your customers, backed by financial consequences (service credits, refunds, contract clauses). SLAs are always more lenient than your SLOs — the SLO is your internal bar, the SLA is what you're willing to defend in a contract negotiation.

Typical structure: SLO = 99.9% availability, SLA = 99.5% availability. The gap between them is your safety margin — the buffer between 'we will detect and respond to this' and 'we owe you money.' Teams without this buffer expose themselves to SLA violations triggered by measurement methodology disagreements, unrelated infrastructure failures, or brief anomalies that recover before investigation.

Not every service needs an SLA. Internal tools, development environments, beta features, and non-critical batch jobs should not have SLAs. Concentrate SLA commitments on the services that directly generate revenue or are contractually critical to enterprise customers.

Error Budgets: The Missing Piece

An error budget is the flip side of an SLO: it's the amount of unreliability your service is allowed to have within the SLO window. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% of 30 days = 43.8 minutes of downtime per month. You start each month with a full budget and spend it through incidents.

Error budgets transform reliability from a compliance exercise into an engineering tool. When the budget is full, teams can move fast — deploy frequently, experiment aggressively, accept more risk. When the budget is depleted, teams must slow down — freeze new feature deployments, prioritize reliability work, and focus on reducing incident frequency before returning to normal velocity.

This creates a direct feedback loop between engineering decisions and reliability outcomes. Feature development that causes incidents depletes the budget; reliability improvements replenish it for next month. The error budget makes the trade-off between speed and reliability explicit and negotiable rather than political.

typescript
// Error budget tracking — what ObservabilityOS SLO dashboard shows
interface ErrorBudget {
  sloTarget: number;          // e.g., 0.999 (99.9%)
  windowDays: number;         // e.g., 30
  currentSLI: number;         // measured availability
  budgetMinutes: number;      // total allowed downtime
  spentMinutes: number;       // downtime consumed so far
  remainingPercent: number;   // budget remaining (0-100%)
  burnRate: number;           // current consumption rate vs budget
}

function calculateErrorBudget(
  sloTarget: number,
  windowDays: number,
  currentSLI: number,
  elapsedDays: number
): ErrorBudget {
  const budgetMinutes = (1 - sloTarget) * windowDays * 24 * 60;
  const spentMinutes = (1 - currentSLI) * elapsedDays * 24 * 60;
  const remainingPercent = Math.max(0, (1 - spentMinutes / budgetMinutes) * 100);
  // Burn rate > 1.0 means you'll exhaust the budget before the window ends
  const burnRate = spentMinutes / budgetMinutes / (elapsedDays / windowDays);

  return { sloTarget, windowDays, currentSLI, budgetMinutes, spentMinutes, remainingPercent, burnRate };
}

Burn Rate Alerts: Acting Before You Breach Your SLO

A burn rate alert fires when you're consuming your error budget faster than your SLO window allows. If you're 5 days into a 30-day window and you've already consumed 30% of your budget, you're burning at 1.8x the sustainable rate. If that rate continues, you'll exhaust your budget in 11 more days — well before the window closes.

Google's SRE book recommends a multi-window, multi-burn-rate alert strategy: a fast burn alert for high burn rates (14.4x, 1-hour window) catches catastrophic incidents, a slow burn alert (3x, 6-hour window) catches gradual degradation. The combination covers both sudden outages and creeping reliability erosion that would otherwise slip past end-of-month reviews.

Setting up burn rate alerts is the most advanced — and most valuable — alerting primitive available. Most teams running static threshold alerts today will see an immediate improvement in alert precision by switching to burn rate alerts: fewer false positives, guaranteed coverage of actual SLO violations, and a quantified economic impact of each incident.

Stop debugging production in the dark

ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.

About the Author

OO

ObservabilityOS Team

Core Engineering & DevRel

The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.