The Fundamental Problem with AWS CloudWatch
CloudWatch is deeply embedded in the AWS ecosystem — it receives logs from Lambda, ECS, EKS, RDS, and dozens of other services without any configuration. This zero-friction integration is its greatest strength and the reason most AWS teams use it by default. But CloudWatch was built as a monitoring service for AWS itself, not as a developer-focused observability platform.
CloudWatch's pricing model is where teams first encounter friction. Custom metrics cost $0.30 per metric per month, with each unique metric name + dimension combination counting as a separate custom metric. Log storage costs $0.03 per GB. Log Insights queries cost $0.005 per GB scanned — meaning a complex query across a large log volume can cost meaningful money. At scale, CloudWatch bills can reach thousands of dollars per month in a way that's difficult to predict or control.
The developer experience problems are harder to quantify but more impactful day-to-day. CloudWatch Logs Insights has a proprietary query language that is not SQL and is not particularly intuitive. Dashboard creation is tedious. Alerts require navigating multiple menus and cannot easily express complex conditions. There is no built-in anomaly detection that actually works out of the box without significant tuning, and there is no AI-powered root cause analysis.
What Good Observability Actually Looks Like
Good observability means: (1) you get an alert within 60 seconds of something going wrong, (2) you can understand what caused it within 5 minutes, (3) you have enough context to fix it without needing to reproduce the issue, and (4) you automatically have documentation of what happened for the post-mortem. CloudWatch delivers number 1, partially delivers number 2, and does not deliver 3 or 4.
The key capabilities that separate modern observability from basic monitoring: structured log search with full-text and field-level querying, distributed tracing across service boundaries, AI-powered anomaly detection that adapts to your traffic patterns, automatic correlation of deploys with metric changes, and plain-English incident summaries that do not require SRE expertise to interpret.
The Alternatives: An Honest Assessment
Datadog is the market leader for a reason — it has the deepest AWS integrations, the most mature alerting capabilities, and an excellent product for teams with dedicated SRE resources. The problem: for teams of under 50 engineers, the cost-to-value ratio is poor. You're paying for capabilities your team will never configure or use.
Grafana + Prometheus is the open-source answer, and for teams with DevOps expertise who want full control, it's excellent. The real cost is engineer time: someone has to install, configure, maintain, upgrade, and debug the stack. A production-grade Grafana setup requires a Prometheus instance, a Loki instance for logs, an alert manager, and dashboards built from scratch. This is 40–60 hours of setup and ongoing maintenance — real costs that are rarely counted against the 'free' price tag.
Better Stack (formerly Logtail) offers fast log search with a cleaner UI than CloudWatch and reasonable pricing. It is excellent for teams that primarily need better log search. It is not a full observability platform — there's limited metrics support, no distributed tracing, and no AI root cause analysis.
- Datadog: Best for large teams ($5,000+/mo). Deep features, complex pricing. Requires SRE to configure.
- Grafana + Prometheus: Best for teams with DevOps expertise. Free to license, expensive in engineer time.
- Better Stack: Best for log search upgrade from CloudWatch. Not a full observability platform.
- New Relic: Good APM heritage, confusing pricing model. Declining market share.
- ObservabilityOS: Best for Series A/B teams. AI-native, zero config, flat $99/mo pricing.
Who Should Stay on CloudWatch
CloudWatch is the right choice if: your infrastructure is entirely within AWS managed services (Lambda, Fargate, RDS), you have minimal custom application metrics, your team has no on-call incidents to investigate, and your log volume is low enough that CloudWatch pricing is not painful. In these scenarios, the zero-configuration advantage of native AWS integration outweighs the platform's limitations.
CloudWatch also makes sense as a data source rather than the primary interface. Many teams run CloudWatch for log aggregation (because it's automatic for AWS services) while using a different platform for search, alerting, and analysis. This hybrid approach gets the best of native AWS integration without accepting CloudWatch's developer experience limitations.
Migration Path: Moving Off CloudWatch in Stages
A full rip-and-replace migration is risky and unnecessary. A staged approach works better: first, add a log shipper to forward CloudWatch logs to your new platform while keeping CloudWatch as a backup. Run both systems in parallel for 2–4 weeks to verify coverage. Then migrate alerting to the new platform. Finally, after 60 days of confidence, reduce CloudWatch retention to 7 days (minimum for some services) and stop paying for long-term log storage there.
For AWS-native services like Lambda and ECS, logs will continue flowing through CloudWatch regardless — these services write to CloudWatch by default. Your log shipper will forward from CloudWatch to your new platform. For applications where you control the logging configuration (EC2, EKS), you can add a log exporter that bypasses CloudWatch entirely and sends directly to your observability backend.
# CloudWatch to ObservabilityOS log forwarding
# Use the Fluent Bit CloudWatch plugin to forward logs
[OUTPUT]
Name http
Match *
Host ingest.observabilityos.com
Port 443
URI /v1/logs
Format json
Header Authorization Bearer YOUR_API_KEY
tls On
tls.verify OnStop debugging production in the dark
ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.
About the Author
ObservabilityOS Team
Core Engineering & DevRel
The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.