Observability#anomaly-detection#z-score#machine-learning#logs

Log Anomaly Detection: Z-Score vs Machine Learning Approaches

A technical comparison of statistical Z-score baselines versus ML-based anomaly detection for production log monitoring. When to use each approach and how they complement each other in a hybrid system.

OO

ObservabilityOS Team

Core Engineering & DevRel

June 14, 20268 min read

Why Anomaly Detection Matters

Static threshold alerts are the leading cause of alert fatigue. A CPU spike at 3 AM during a scheduled backup job triggers a page even though the system is healthy — because the threshold doesn't know about the backup. Modern anomaly detection adapts to your system's actual behavior patterns rather than firing on fixed numbers. ObservabilityOS uses a hybrid approach: statistical Z-score analysis for real-time detection and ML models for pattern recognition over longer time windows.

Z-Score: Fast and Interpretable

The Z-score measures how many standard deviations a data point is from the rolling mean. A Z-score above 3 indicates a statistically significant anomaly. Z-scores are lightweight, explainable, and work well for metrics with approximately normal distributions: error rates, latency, request throughput. The key advantage is speed and interpretability — you can show engineers exactly why an alert fired.

typescript
function calculateZScore(value: number, window: number[]): number {
  if (window.length < 30) return 0; // insufficient data
  const mean = window.reduce((a, b) => a + b, 0) / window.length;
  const variance = window.reduce((s, v) => s + (v - mean) ** 2, 0) / window.length;
  const stdDev = Math.sqrt(variance);
  return stdDev === 0 ? 0 : (value - mean) / stdDev;
}
// Z > 3.0 → anomaly | Z > 4.0 → critical

ML Approaches: Pattern Recognition at Scale

Machine learning models excel at detecting subtle patterns Z-scores miss: gradual drift, seasonal correlations across multiple services, and multi-dimensional anomalies where no single metric crosses a threshold but the combination of several is unprecedented. However, ML models require training data, more compute, and are harder to debug — you can't explain a neural network's output in one sentence.

ObservabilityOS uses ML for weekly trend analysis and capacity planning — longer time horizons where model training latency is acceptable. Z-scores handle real-time incident detection where you need a result in milliseconds, not seconds. This hybrid gives you the speed of statistical methods for immediate alerting and the depth of ML for proactive insights.

Stop debugging production in the dark

ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.

About the Author

OO

ObservabilityOS Team

Core Engineering & DevRel

The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.