The Problem Distributed Tracing Solves
A user clicks 'Buy Now.' Your API gateway receives the request. It calls the inventory service to check stock. That calls the product service for pricing. The order service creates a record. The payment service charges the card. The notification service sends a confirmation email. Seven services, seven logs, seven sets of metrics — all for one user action.
When this workflow takes 4 seconds instead of 400ms, which service is slow? Logs will tell you each service handled a request, but not which request is the slow one or how they connect. Metrics will tell you average latency across all requests, but not that this specific user on this specific device is seeing P99 latency. Distributed tracing is the solution: a single view of the entire request's journey across every service it touched, with timing information for each step.
Distributed tracing is particularly powerful for debugging problems that span service boundaries: cascading failures (service A is slow because service B is slow because service C has a missing index), request fan-out inefficiencies (10 parallel calls where 3 would suffice), and tail latency issues (99th percentile slowness caused by occasional lock contention in a downstream service).
How Traces Work: Spans and Context
A trace is a collection of spans that represent a single transaction through a distributed system. A span represents one unit of work within the trace — an HTTP request handled by a service, a database query, a cache lookup, an external API call. Every span has: a unique span ID, the trace ID it belongs to, a parent span ID (the span that triggered this one), a start timestamp, a duration, a status (success or error), and attributes (arbitrary key-value metadata).
The relationship between spans is a directed acyclic graph (DAG): the root span (the first span in the trace) has no parent, and every subsequent span has exactly one parent. This creates a tree structure that mirrors the call graph of your request. When rendered as a flame graph, you can immediately see which spans are sequential and which are parallel, and where the time is going.
// Anatomy of a span — what OpenTelemetry records
interface Span {
traceId: string; // "4bf92f3577b34da6a3ce929d0e0e4736" — same across all spans in the trace
spanId: string; // "00f067aa0ba902b7" — unique per span
parentSpanId: string; // "a2fb4a1d1a96d312" — who triggered this span
name: string; // "POST /api/checkout" or "mongodb.find" or "redis.get"
startTime: number; // epoch nanoseconds
endTime: number; // epoch nanoseconds
status: 'OK' | 'ERROR';
attributes: {
'http.method': 'POST',
'http.url': '/api/checkout',
'http.status_code': 200,
'db.system': 'mongodb',
'db.operation': 'find',
// ... any custom attributes
};
events: SpanEvent[]; // timestamped annotations within the span
}Context Propagation: How Spans Connect Across Services
Context propagation is the mechanism that makes distributed tracing work across service boundaries. When service A makes an HTTP request to service B, it includes special headers that tell service B which trace and parent span this request belongs to. Service B reads these headers, creates a child span with the correct parent, and passes the context along to any services it calls.
The W3C TraceContext specification standardizes these headers: traceparent contains the trace ID and parent span ID in a standardized format, and tracestate carries vendor-specific trace state. All modern observability platforms and OpenTelemetry implementations support W3C TraceContext natively.
// W3C TraceContext header format
// traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
// ^ ^ ^ ^
// | trace ID (32 hex chars) span ID sampled flag
// version
// With OpenTelemetry, this propagation is AUTOMATIC.
// When you call fetch() or axios.get(), OTel injects traceparent automatically.
// When your Express server receives a request, OTel extracts it automatically.
// You don't write a single line of propagation code.Implementing Your First Trace in Node.js
With OpenTelemetry auto-instrumentation enabled, you get trace spans automatically for every HTTP request, Express route, MongoDB query, and Redis command. But auto-instrumentation only captures framework-level operations. Your business logic — the part that makes your application unique — needs manual instrumentation.
// Adding tracing to a business logic function
import { trace, SpanStatusCode, context, propagation } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service', '1.0.0');
export async function createOrder(
userId: string,
items: CartItem[]
): Promise<Order> {
return tracer.startActiveSpan('order.create', async (span) => {
span.setAttributes({
'order.user_id': userId,
'order.item_count': items.length,
'order.total_cents': items.reduce((sum, i) => sum + i.priceCents, 0),
});
try {
// These nested calls automatically become child spans
const inventory = await checkInventory(items); // child span
const payment = await processPayment(userId); // child span
const order = await saveOrder(userId, items); // child span (mongo)
span.addEvent('order.created', { orderId: order.id });
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (err) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: (err as Error).message,
});
span.recordException(err as Error);
throw err;
} finally {
span.end();
}
});
}Reading Flame Graphs: What to Look For
A flame graph visualizes trace spans as horizontal bars on a time axis. The root span stretches across the top. Child spans appear below their parent, beginning at the time they were called. The width of each bar represents its duration. Gaps between spans in the same service represent time spent waiting for child calls to return or for CPU scheduling.
Three patterns to look for in a flame graph: a single wide span that dominates the trace (one operation is causing all the latency — investigate that service first), a long sequential chain of narrow spans (N+1 query pattern — many calls that could be batched into one), and a span that starts immediately but takes a long time (latency inside a single service rather than a call overhead problem).
The most useful diagnostic view is the 'critical path' — the sequence of spans from root to deepest leaf along the longest total duration. This is where the time is actually going, even if the visual flame graph looks complex. ObservabilityOS highlights the critical path automatically in the trace viewer.
Common Tracing Pitfalls
Missing context propagation breaks traces at service boundaries — you see two separate traces instead of one connected trace. This happens when HTTP calls are made without OTel instrumentation, when custom HTTP clients bypass the propagator, or when gRPC is used without the OTel gRPC instrumentation. Check that every service-to-service call path has OTel auto-instrumentation enabled.
High cardinality span attributes can cause backend storage problems. Tags like user.id or order.id are high cardinality and valuable for finding specific traces. Tags like request.body or response.payload can contain unbounded data and should be truncated or excluded. Set attribute length limits and avoid logging request bodies verbatim in spans.
Aggressive sampling can make rare error traces invisible. If you're sampling 1% of traces and a bug affects 0.5% of requests, you might miss it entirely. Use error-based sampling to ensure 100% of error traces are preserved regardless of your base sampling rate.
Stop debugging production in the dark
ObservabilityOS gives every engineer AI-powered incident intelligence. Zero config. Connects in 5 minutes.
About the Author
ObservabilityOS Team
Core Engineering & DevRel
The core engineering, site reliability, and developer relations team behind ObservabilityOS. We build AI-native observability infrastructure to eliminate 3 AM firefighting.