Why Distributed Tracing Matters
Why Distributed Tracing Matters
The Problem: A user clicks “Checkout” and waits 5 seconds. The request touched the API gateway, auth, cart, pricing, inventory, payments, fraud, shipping, tax, notifications, and three databases. Logs from each service tell you that each one returned 200 OK. None of them tell you where the 5 seconds went.
The Solution: A distributed trace stitches every operation triggered by that one click into a single causally-linked tree, with timing on every span. The flame graph makes the slow service obvious in one glance.
Real Impact: Google’s Dapper paper described how a single search query at Google fans out to over a thousand servers. Without tracing, debugging a tail-latency regression in that system is hopeless. With tracing, an engineer finds the offender in minutes.
Real-World Analogy
Think about shipping a package across the country:
- A logging system is like a delivery confirmation — you know it arrived (or didn’t), nothing more
- A metrics dashboard is like average delivery times across all packages — useful, but useless when your package is late
- A distributed trace is the full tracking page — warehouse scan, sort facility, two trucks, a plane, the local depot, the driver’s van — with a timestamp on every step
When something is late, the tracking page shows you exactly where it sat for six hours. That’s what a trace does for a request flowing through your services.
Logs are per-process. Metrics are aggregated. Neither tells you the causal story of one specific request as it crosses service boundaries. In a monolith you can scroll a single log file and follow a request from top to bottom. In microservices that file lives on twelve hosts, and the only thing that ties the entries together is whatever request-id you remembered to thread through every call. A trace is that thread, made first-class.
What logs and metrics can’t answer
| Question | Logs | Metrics | Traces |
|---|---|---|---|
| Did this request succeed? | Yes | Aggregate only | Yes |
| How many errors per minute? | Possible, slow | Yes | Yes |
| Which service caused this slow request? | No | No | Yes |
| What was the call graph for this request? | No | No | Yes |
| Why is P99 worse than P50? | No | Hints only | Yes |
| Did this request hit a cold cache? | Maybe | No | Yes (with attributes) |
The pattern is clear: anything that asks “why was this specific request slow” or “what path did this specific request take” can only be answered by a trace.
The Trace Model
The Vocabulary You Have to Know
The Problem: Tracing tools throw around “trace,” “span,” “context,” and “baggage” as if they’re obvious. They aren’t.
The Solution: Three nouns — trace, span, context — do most of the work. Learn them once and every backend uses the same model.
Core terms
- Trace — the whole story of one request through the system. Identified by a 128-bit
trace_id. - Span — one unit of work in that story (an HTTP call, a DB query, a function). Identified by a 64-bit
span_id, with timing, status, and attributes. - Parent / child — a span that triggered another holds the parent reference. Together the spans form a tree (mostly — async fan-out makes a DAG).
- Span context — the minimal envelope (trace_id, span_id, trace_flags) carried between processes so the next service can attach its spans to the right trace.
- Attributes — structured key/value tags on a span (
http.method,db.system,user.id). - Events — timestamped log-style markers inside a span (“cache miss”, “retry started”).
- Links — a reference from one span to another span in a different trace, used when a job processes batched messages from multiple sources.
A trace is a tree of spans. Each span has a start time, an end time, and a parent. The root span is the one with no parent — usually the one created at the edge (API gateway, ingress). Every later span carries the same trace_id, so the backend can reassemble the tree from spans that were emitted by twelve different processes.
Notice the “220 ms” on the HTTP POST inventory.reserve span versus “180 ms” on the inner db.update stock span. The 40 ms gap is network and serialization. That’s a real signal you can act on — only visible because you have spans on both sides of the call.
Context Propagation
Why Context Is the Whole Game
The Problem: Every service that handles a request needs to know which trace and which parent span it belongs to. Lose that envelope at any hop and the trace falls apart into orphan spans.
The Solution: Standardize on W3C Trace Context headers (traceparent / tracestate). Inject on outbound calls, extract on inbound. Same model for HTTP, gRPC, Kafka, SQS — only the wire format changes.
A trace without context propagation is just a sad list of logs. The tracing SDK on the calling service serializes the active span context into the outbound request; the SDK on the receiving service extracts it and uses it as the parent of its new server-side span. If any link in the chain forgets to propagate, you get two disconnected traces and the flame graph stops at that boundary.
HTTP — W3C Trace Context
# A traceparent header looks like this:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
# ^^ version
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ trace_id (16 bytes hex)
# ^^^^^^^^^^^^^^^^ parent span_id (8 bytes hex)
# ^^ trace flags (01 = sampled)
# tracestate carries vendor-specific extras (Datadog, Honeycomb, etc.):
tracestate: dd=p:00f067aa0ba902b7;s:1,honeycomb=...
W3C Trace Context replaced a decade of incompatible header formats (B3 single-header, B3 multi-header, X-Cloud-Trace-Context, etc.). B3 is still common in the wild — if you have an Istio mesh or older Zipkin instrumentation, you’ll see x-b3-traceid, x-b3-spanid, x-b3-sampled. Modern OpenTelemetry SDKs default to W3C and can be configured to also accept B3.
gRPC and async transports
| Transport | Where context goes | Common gotcha |
|---|---|---|
| HTTP / REST | traceparent, tracestate headers | Reverse proxies stripping unknown headers |
| gRPC | gRPC metadata (binary or text) | Custom interceptors that drop metadata |
| Kafka | Record headers (since 0.11) | Old consumer libs that ignore headers |
| AWS SQS / SNS | Message attributes | 10-attribute SQS limit; some SDKs eat the slot |
| RabbitMQ / AMQP | Message headers | Re-enqueue / DLQ flows that strip headers |
| Cron / scheduled jobs | New root trace per run | Treating a cron run as a continuation breaks the model |
Manual injection into a Kafka producer
from opentelemetry import trace, propagate
from confluent_kafka import Producer
tracer = trace.get_tracer("orders.publisher")
producer = Producer({"bootstrap.servers": "kafka:9092"})
def publish_order(order: dict) -> None:
with tracer.start_as_current_span("kafka.publish orders") as span:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "orders")
span.set_attribute("order.id", order["id"])
# Inject the active span context into a plain dict, then convert
# to Kafka header format (list of (key, value-bytes) tuples).
carrier: dict = {}
propagate.inject(carrier)
headers = [(k, v.encode("utf-8")) for k, v in carrier.items()]
producer.produce(
topic="orders",
key=order["id"].encode(),
value=json.dumps(order).encode(),
headers=headers, # traceparent rides here
)
producer.flush()
The consumer side does the mirror: read headers into a carrier dict, call propagate.extract(carrier), and use the result as the parent context when starting the consumer span. Done correctly, the producer span and the consumer span (running in a different process, possibly hours later) appear in the same trace as parent and child.
Treating async as sync breaks your traces
If an inbound HTTP request enqueues a message and returns immediately, the consumer span will run after the HTTP root span has already ended. Most backends still render this correctly — but only if you propagated the context. If you did not, the consumer creates a brand-new root trace and you lose the causal link forever. Audit every queue boundary in your system.
OpenTelemetry
Why OpenTelemetry Won
The Problem: The old world had OpenTracing, OpenCensus, vendor agents from Datadog and New Relic, plus Zipkin and Jaeger client libraries — all incompatible. Every team rewrote instrumentation when they switched backends.
The Solution: OpenTelemetry (OTel) is the CNCF’s merger of OpenTracing and OpenCensus. One vendor-neutral SDK, one wire protocol (OTLP), pluggable exporters. Instrument once, send anywhere.
OpenTelemetry has three pieces you need to understand:
- SDK — runs in your app. Creates spans, adds attributes, manages sampling, batches export.
- Auto-instrumentation — libraries that monkey-patch popular frameworks (FastAPI, Express, Spring, gRPC, requests, redis, psycopg, kafka-python) so you get spans without writing tracer code for every HTTP client.
- Collector — a separate process that receives spans over OTLP, runs processors (batching, sampling, attribute scrubbing), and exports to one or more backends. Lets you change backends without touching apps.
Python — FastAPI auto-instrumentation
# pip install opentelemetry-distro opentelemetry-exporter-otlp \
# opentelemetry-instrumentation-fastapi \
# opentelemetry-instrumentation-requests
from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# 1. Identify the service in every span (resource attributes).
resource = Resource.create({
"service.name": "orders",
"service.version": "2.4.1",
"deployment.environment": "production",
})
# 2. 10% head sampler — defer to parent if there is one.
sampler = ParentBased(root=TraceIdRatioBased(0.10))
# 3. Provider + OTLP exporter pointing at the local collector.
provider = TracerProvider(resource=resource, sampler=sampler)
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)
))
trace.set_tracer_provider(provider)
# 4. Auto-instrument the framework and the HTTP client.
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()
tracer = trace.get_tracer("orders")
@app.post("/checkout")
async def checkout(req: CheckoutReq):
with tracer.start_as_current_span("checkout.handle") as span:
span.set_attribute("order.id", req.order_id)
span.set_attribute("cart.size", len(req.items))
result = await process(req)
span.set_attribute("order.total", result.total)
return result
Node.js — Express auto-instrumentation
// npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
// @opentelemetry/exporter-trace-otlp-grpc @opentelemetry/resources
// File: tracing.js — require this BEFORE express, http, pg, etc.
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } =
require('@opentelemetry/semantic-conventions');
const { OTLPTraceExporter } =
require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } =
require('@opentelemetry/auto-instrumentations-node');
const { ParentBasedSampler, TraceIdRatioBasedSampler } =
require('@opentelemetry/sdk-trace-base');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'payments',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.8.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317',
}),
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.10),
}),
instrumentations: [getNodeAutoInstrumentations({
// Skip noisy file-system spans in production
'@opentelemetry/instrumentation-fs': { enabled: false },
})],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
Once tracing.js is loaded first (e.g. node -r ./tracing.js server.js), the auto-instrumentations wrap Express, the http client, pg, redis, kafkajs, ioredis, and dozens more. You get a populated trace tree with attributes on day one, before you write a single manual span.
Where the Collector fits
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1000
# Strip PII from spans before they leave your network.
attributes:
actions:
- key: user.email
action: delete
- key: http.request.header.authorization
action: delete
# Tail-based sampling: see ALL spans of a trace, then decide.
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 2000
policies:
- name: errors-keep-100pct
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-keep-100pct
type: latency
latency: { threshold_ms: 2000 }
- name: baseline-1pct
type: probabilistic
probabilistic: { sampling_percentage: 1.0 }
exporters:
otlphttp/tempo:
endpoint: https://tempo:4318
otlphttp/honeycomb:
endpoint: https://api.honeycomb.io
headers:
x-honeycomb-team: ${HONEYCOMB_API_KEY}
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes, tail_sampling, batch]
exporters: [otlphttp/tempo, otlphttp/honeycomb]
Run the collector as a sidecar (per-pod), a DaemonSet (per-node), or a central gateway. The DaemonSet pattern is the production default — one collector per node, low latency for app exporters, and tail sampling that sees a meaningful slice of traffic.
Sampling
Why You Can’t Sample at 100%
The Problem: A single trace can have hundreds of spans. At 10k requests per second, 100% sampling produces millions of spans per second — tens of TB per day. Your storage bill explodes; query latency on the backend collapses.
The Solution: Sample. The art is sampling in a way that keeps the traces you actually need — errors, slow tails, unusual users — while throwing away the boring 99%.
Head-based vs tail-based
| Property | Head-based | Tail-based |
|---|---|---|
| Decision point | At the root span, before the request runs | After all spans of the trace are collected |
| Decision based on | trace_id hash, request attributes | Outcome (error, latency, attributes) |
| Where it runs | SDK in the app | Collector or backend |
| Cost | Cheap — reject spans before creation | Expensive — ingest everything, then drop |
| Can it always keep errors? | No — the decision happens before the error is known | Yes — this is its main use case |
| Trace consistency | Guaranteed (parent decision propagates) | Guaranteed (decision is per-trace) |
Common sampling strategies
- Probabilistic (head) — keep N% of traces by hashing
trace_id. Simple, predictable. Default starting point: 1–10%. - Rate-limited (head) — keep at most N traces/sec per service. Caps cost; useful at the edge.
- Parent-based (head) — honor the upstream sampling decision (the
01intraceparent). Without this, half your trace is sampled and half isn’t — useless. - Error-aware (tail) — keep 100% of traces with
status=ERROR. - Latency-aware (tail) — keep 100% of traces over P95 latency.
- Attribute-based (tail) — keep 100% of traces from VIP customers, new releases, debug headers, paid tiers.
Never run 100% sampling unmonitored in production
It works in dev and staging. The day a marketing campaign sends 10x normal traffic, your collector OOMs, your backend rejects writes, and you lose the very traces you were trying to capture. Set a probabilistic head sampler with a known cost plus tail-sampling rules that always keep errors and slow traces. That gives you the best of both.
Reading Traces in Practice
Knowing What “Bad” Looks Like
The Problem: A flame graph with 200 spans is intimidating. New engineers stare at it and don’t know what to fix.
The Solution: Three or four shapes show up over and over in slow traces. Once you can name them, you can fix them.
The shapes that signal trouble
| Shape in the flame graph | What it usually means | Fix direction |
|---|---|---|
| Child span much wider than the work it does | Network or queueing time at the boundary | Co-locate, batch, or warm the connection pool |
| Long sequential chain of DB spans (N+1) | ORM lazy-loading per row | Batch the query, eager-fetch, or denormalize |
| Wide fan-out, tail dominates total | Slowest-of-N replica problem | Hedge requests, raise replica count, fix the slow one |
| Big idle gap inside a parent span | Synchronous lock or unrelated wait | Profile inside the span, add events to find it |
| Trace stops mid-call graph | Context propagation broken at that hop | Audit instrumentation on that boundary |
| Same span appears 50x in one trace | Retry storm or per-item loop | Bulk operation, exponential backoff, dedupe |
Pay attention to the difference between span duration and self time (duration minus time spent in children). A parent span that is 5 s wide but spends 4.9 s in one child has 0.1 s of self time — the child is the problem, not the parent.
The N+1 in a flame graph
An N+1 query is unmissable in a trace: one parent span (e.g. GET /orders) with 50 narrow DB child spans stacked horizontally. Logs would never show this; they’d just show 50 successful queries. Traces show it as a comb pattern that takes the whole request width.
Storage Backends and Tools
The OTLP wire protocol means your apps don’t care which backend you pick. The backends differ in storage cost, query model, and how cleanly they integrate with metrics and logs.
| Backend | Type | Storage | Picks for |
|---|---|---|---|
| Jaeger | Self-hosted OSS | Cassandra, Elasticsearch, OpenSearch, Badger | Teams that already run a search cluster; CNCF-friendly stack |
| Grafana Tempo | Self-hosted OSS | Object storage (S3, GCS) only | Cheap, scalable trace-as-cold-storage; pairs with Loki + Prometheus |
| Zipkin | Self-hosted OSS (older) | Cassandra, MySQL, Elasticsearch | Existing Zipkin shops; B3 ecosystems |
| Honeycomb | SaaS | Proprietary columnar store | High-cardinality querying, BubbleUp, observability-2.0 workflows |
| Datadog APM | SaaS | Proprietary | Single pane with metrics + logs + RUM; large enterprise |
| AWS X-Ray | Managed (AWS) | AWS-managed | AWS-native shops; tight Lambda / API Gateway integration |
| Lightstep / ServiceNow Cloud Observability | SaaS | Proprietary | Tail sampling at scale, change-impact analysis |
How to choose
- If you already run Grafana + Prometheus + Loki: Tempo. Same UI, same auth, S3-cheap storage.
- If your engineers want to ask new questions interactively: Honeycomb. High-cardinality columnar query is its superpower.
- If you want one vendor for traces, logs, metrics, RUM, and synthetics: Datadog APM — expect the bill to scale with span volume.
- If you live in AWS and use Lambda heavily: X-Ray, at least for the Lambda parts. Bridge to a real backend for the rest.
- If you want OSS without operating a search cluster: Tempo. Jaeger is fine; just budget for Elasticsearch/Cassandra ops.
Real-World Examples
Google — Dapper (2010). The paper that started the field. Dapper described how Google instruments every RPC, propagates a trace ID through every layer of its stack, and samples aggressively (often well below 1%) because the absolute volume is enough to keep traces statistically useful. The key insight in the paper: tracing has to be always-on, low-overhead, and application-transparent, or engineers will turn it off.
Twitter — Zipkin (2012). Twitter open-sourced its Dapper-inspired tracer. Zipkin defined the B3 propagation headers that dominated the next decade, built the basic UI vocabulary (trace search, span detail, dependency graph) that every later tool copied, and proved that distributed tracing was a generally useful tool, not a Google-only luxury.
Uber — Jaeger (2015). Uber needed something that scaled to thousands of services and donated the result to the CNCF. Jaeger added adaptive sampling (per-service, per-operation rates that the backend computes and pushes down to clients), and remains the reference open-source tracer.
Honeycomb — observability 2.0. Honeycomb’s framing is that the fundamental unit shouldn’t be the metric or the log line; it should be the wide, structured event — which is, basically, a span with a lot of attributes. Their pitch: keep raw events with high cardinality and you can ask questions you didn’t know you’d need to ask. This is why “adding more attributes to your spans” is a core debugging technique now, not a quaint suggestion.
OpenTelemetry — the CNCF merger. OpenTracing (2016) and OpenCensus (2017) competed for years. In 2019 they merged into OpenTelemetry, which is now the second-most-active CNCF project after Kubernetes. Every major vendor — Datadog, New Relic, Honeycomb, Splunk, Dynatrace, AWS, Google, Azure — ships OTel-native ingestion. The format war is over.
Best Practices
The short list
- Instrument at the edge first. The API gateway or ingress is where every trace begins; if it’s instrumented, every downstream span hangs off it.
- Use auto-instrumentation, then add the missing 10% manually. Don’t hand-roll spans for your HTTP client when an instrumentation library does it correctly already.
- Set a service.name and version on every process. Without resource attributes, the trace UI shows you a tree of nameless boxes.
- Always use parent-based sampling. A trace where half the spans are sampled and half aren’t is worse than no trace at all.
- Combine head sampling for cost with tail sampling for signal. Errors and slow traces should be 100%; the boring baseline can be 1%.
- Add high-cardinality attributes to spans.
user.id,tenant.id,feature.flag,build.sha. These are how you slice traces during incidents. - Strip PII at the collector. Don’t leak emails, tokens, or PAN data to a third-party SaaS. Do it once in the collector, not in every app.
- Link logs and traces with the trace_id. Inject
trace_idinto your structured log formatter so a log line jumps to its trace in one click. - Audit context propagation across queues. HTTP usually just works. Kafka, SQS, and cron jobs are where traces silently die.
- Treat the collector as production infrastructure. It is. Capacity-plan it, alert on its queue depth, and run it close to your apps.
Trace IDs in logs are useful only if sampling agrees
If your logs include trace_id but the trace was dropped by the head sampler, the log line links to a 404 in the tracing UI. The fix: log the trace_id regardless, but only emit a “jump to trace” link when the trace_flags show the trace was sampled. Better yet, do tail sampling so error logs always have a corresponding trace.
The single most useful sentence about tracing
If your incident postmortem says “we eventually figured out it was Service C by reading logs on five hosts” — you needed a trace, and you didn’t have one (or had one and didn’t look). Distributed tracing is what turns “eventually” into “in two minutes.”