Why Anomaly Detection Matters
Static threshold alerts are the leading cause of alert fatigue. A CPU spike at 3 AM during a scheduled backup job triggers a page even though the system is healthy — because the threshold doesn't know about the backup. Modern anomaly detection adapts to your system's actual behavior patterns rather than firing on fixed numbers. ObservabilityOS uses a hybrid approach: statistical Z-score analysis for real-time detection and ML models for pattern recognition over longer time windows.
Z-Score: Fast and Interpretable
The Z-score measures how many standard deviations a data point is from the rolling mean. A Z-score above 3 indicates a statistically significant anomaly. Z-scores are lightweight, explainable, and work well for metrics with approximately normal distributions: error rates, latency, request throughput. The key advantage is speed and interpretability — you can show engineers exactly why an alert fired.
function calculateZScore(value: number, window: number[]): number {
if (window.length < 30) return 0; // insufficient data
const mean = window.reduce((a, b) => a + b, 0) / window.length;
const variance = window.reduce((s, v) => s + (v - mean) ** 2, 0) / window.length;
const stdDev = Math.sqrt(variance);
return stdDev === 0 ? 0 : (value - mean) / stdDev;
}
// Z > 3.0 → anomaly | Z > 4.0 → criticalML Approaches: Pattern Recognition at Scale
Machine learning models excel at detecting subtle patterns Z-scores miss: gradual drift, seasonal correlations across multiple services, and multi-dimensional anomalies where no single metric crosses a threshold but the combination of several is unprecedented. However, ML models require training data, more compute, and are harder to debug — you can't explain a neural network's output in one sentence.
ObservabilityOS uses ML for weekly trend analysis and capacity planning — longer time horizons where model training latency is acceptable. Z-scores handle real-time incident detection where you need a result in milliseconds, not seconds. This hybrid gives you the speed of statistical methods for immediate alerting and the depth of ML for proactive insights.
Stop debugging production in the dark
ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.
About the Author
ObservabilityOS Team
Core Engineering & DevRel
The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.