Distributed Tracing | LIZIU Microservices

Why Distributed Tracing Matters

The Problem: A user clicks “Checkout” and waits 5 seconds. The request touched the API gateway, auth, cart, pricing, inventory, payments, fraud, shipping, tax, notifications, and three databases. Logs from each service tell you that each one returned 200 OK. None of them tell you where the 5 seconds went.

The Solution: A distributed trace stitches every operation triggered by that one click into a single causally-linked tree, with timing on every span. The flame graph makes the slow service obvious in one glance.

Real Impact: Google’s Dapper paper described how a single search query at Google fans out to over a thousand servers. Without tracing, debugging a tail-latency regression in that system is hopeless. With tracing, an engineer finds the offender in minutes.

Real-World Analogy

Think about shipping a package across the country:

A logging system is like a delivery confirmation — you know it arrived (or didn’t), nothing more
A metrics dashboard is like average delivery times across all packages — useful, but useless when your package is late
A distributed trace is the full tracking page — warehouse scan, sort facility, two trucks, a plane, the local depot, the driver’s van — with a timestamp on every step

When something is late, the tracking page shows you exactly where it sat for six hours. That’s what a trace does for a request flowing through your services.

Logs are per-process. Metrics are aggregated. Neither tells you the causal story of one specific request as it crosses service boundaries. In a monolith you can scroll a single log file and follow a request from top to bottom. In microservices that file lives on twelve hosts, and the only thing that ties the entries together is whatever request-id you remembered to thread through every call. A trace is that thread, made first-class.

What logs and metrics can’t answer

Question	Logs	Metrics	Traces
Did this request succeed?	Yes	Aggregate only	Yes
How many errors per minute?	Possible, slow	Yes	Yes
Which service caused this slow request?	No	No	Yes
What was the call graph for this request?	No	No	Yes
Why is P99 worse than P50?	No	Hints only	Yes
Did this request hit a cold cache?	Maybe	No	Yes (with attributes)

The pattern is clear: anything that asks “why was this specific request slow” or “what path did this specific request take” can only be answered by a trace.

The Trace Model

The Vocabulary You Have to Know

The Problem: Tracing tools throw around “trace,” “span,” “context,” and “baggage” as if they’re obvious. They aren’t.

The Solution: Three nouns — trace, span, context — do most of the work. Learn them once and every backend uses the same model.

Core terms

Trace — the whole story of one request through the system. Identified by a 128-bit trace_id.
Span — one unit of work in that story (an HTTP call, a DB query, a function). Identified by a 64-bit span_id, with timing, status, and attributes.
Parent / child — a span that triggered another holds the parent reference. Together the spans form a tree (mostly — async fan-out makes a DAG).
Span context — the minimal envelope (trace_id, span_id, trace_flags) carried between processes so the next service can attach its spans to the right trace.
Attributes — structured key/value tags on a span (http.method, db.system, user.id).
Events — timestamped log-style markers inside a span (“cache miss”, “retry started”).
Links — a reference from one span to another span in a different trace, used when a job processes batched messages from multiple sources.

A trace is a tree of spans. Each span has a start time, an end time, and a parent. The root span is the one with no parent — usually the one created at the edge (API gateway, ingress). Every later span carries the same trace_id, so the backend can reassemble the tree from spans that were emitted by twelve different processes.

Notice the “220 ms” on the HTTP POST inventory.reserve span versus “180 ms” on the inner db.update stock span. The 40 ms gap is network and serialization. That’s a real signal you can act on — only visible because you have spans on both sides of the call.

Context Propagation

Why Context Is the Whole Game

The Problem: Every service that handles a request needs to know which trace and which parent span it belongs to. Lose that envelope at any hop and the trace falls apart into orphan spans.

The Solution: Standardize on W3C Trace Context headers (traceparent / tracestate). Inject on outbound calls, extract on inbound. Same model for HTTP, gRPC, Kafka, SQS — only the wire format changes.

A trace without context propagation is just a sad list of logs. The tracing SDK on the calling service serializes the active span context into the outbound request; the SDK on the receiving service extracts it and uses it as the parent of its new server-side span. If any link in the chain forgets to propagate, you get two disconnected traces and the flame graph stops at that boundary.

HTTP — W3C Trace Context

# A traceparent header looks like this:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
#             ^^ version
#                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ trace_id (16 bytes hex)
#                                                   ^^^^^^^^^^^^^^^^ parent span_id (8 bytes hex)
#                                                                    ^^ trace flags (01 = sampled)

# tracestate carries vendor-specific extras (Datadog, Honeycomb, etc.):
tracestate: dd=p:00f067aa0ba902b7;s:1,honeycomb=...

W3C Trace Context replaced a decade of incompatible header formats (B3 single-header, B3 multi-header, X-Cloud-Trace-Context, etc.). B3 is still common in the wild — if you have an Istio mesh or older Zipkin instrumentation, you’ll see x-b3-traceid, x-b3-spanid, x-b3-sampled. Modern OpenTelemetry SDKs default to W3C and can be configured to also accept B3.

gRPC and async transports

Transport	Where context goes	Common gotcha
HTTP / REST	`traceparent`, `tracestate` headers	Reverse proxies stripping unknown headers
gRPC	gRPC metadata (binary or text)	Custom interceptors that drop metadata
Kafka	Record headers (since 0.11)	Old consumer libs that ignore headers
AWS SQS / SNS	Message attributes	10-attribute SQS limit; some SDKs eat the slot
RabbitMQ / AMQP	Message headers	Re-enqueue / DLQ flows that strip headers
Cron / scheduled jobs	New root trace per run	Treating a cron run as a continuation breaks the model

Manual injection into a Kafka producer

from opentelemetry import trace, propagate
from confluent_kafka import Producer

tracer = trace.get_tracer("orders.publisher")
producer = Producer({"bootstrap.servers": "kafka:9092"})

def publish_order(order: dict) -> None:
    with tracer.start_as_current_span("kafka.publish orders") as span:
        span.set_attribute("messaging.system", "kafka")
        span.set_attribute("messaging.destination", "orders")
        span.set_attribute("order.id", order["id"])

        # Inject the active span context into a plain dict, then convert
        # to Kafka header format (list of (key, value-bytes) tuples).
        carrier: dict = {}
        propagate.inject(carrier)
        headers = [(k, v.encode("utf-8")) for k, v in carrier.items()]

        producer.produce(
            topic="orders",
            key=order["id"].encode(),
            value=json.dumps(order).encode(),
            headers=headers,   # traceparent rides here
        )
        producer.flush()

The consumer side does the mirror: read headers into a carrier dict, call propagate.extract(carrier), and use the result as the parent context when starting the consumer span. Done correctly, the producer span and the consumer span (running in a different process, possibly hours later) appear in the same trace as parent and child.

Treating async as sync breaks your traces

If an inbound HTTP request enqueues a message and returns immediately, the consumer span will run after the HTTP root span has already ended. Most backends still render this correctly — but only if you propagated the context. If you did not, the consumer creates a brand-new root trace and you lose the causal link forever. Audit every queue boundary in your system.

OpenTelemetry

Why OpenTelemetry Won

The Problem: The old world had OpenTracing, OpenCensus, vendor agents from Datadog and New Relic, plus Zipkin and Jaeger client libraries — all incompatible. Every team rewrote instrumentation when they switched backends.

The Solution: OpenTelemetry (OTel) is the CNCF’s merger of OpenTracing and OpenCensus. One vendor-neutral SDK, one wire protocol (OTLP), pluggable exporters. Instrument once, send anywhere.

OpenTelemetry has three pieces you need to understand:

SDK — runs in your app. Creates spans, adds attributes, manages sampling, batches export.
Auto-instrumentation — libraries that monkey-patch popular frameworks (FastAPI, Express, Spring, gRPC, requests, redis, psycopg, kafka-python) so you get spans without writing tracer code for every HTTP client.
Collector — a separate process that receives spans over OTLP, runs processors (batching, sampling, attribute scrubbing), and exports to one or more backends. Lets you change backends without touching apps.

Python — FastAPI auto-instrumentation

# pip install opentelemetry-distro opentelemetry-exporter-otlp \
#             opentelemetry-instrumentation-fastapi \
#             opentelemetry-instrumentation-requests

from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# 1. Identify the service in every span (resource attributes).
resource = Resource.create({
    "service.name": "orders",
    "service.version": "2.4.1",
    "deployment.environment": "production",
})

# 2. 10% head sampler — defer to parent if there is one.
sampler = ParentBased(root=TraceIdRatioBased(0.10))

# 3. Provider + OTLP exporter pointing at the local collector.
provider = TracerProvider(resource=resource, sampler=sampler)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)
))
trace.set_tracer_provider(provider)

# 4. Auto-instrument the framework and the HTTP client.
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()

tracer = trace.get_tracer("orders")

@app.post("/checkout")
async def checkout(req: CheckoutReq):
    with tracer.start_as_current_span("checkout.handle") as span:
        span.set_attribute("order.id", req.order_id)
        span.set_attribute("cart.size", len(req.items))
        result = await process(req)
        span.set_attribute("order.total", result.total)
        return result

Node.js — Express auto-instrumentation

// npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
//        @opentelemetry/exporter-trace-otlp-grpc @opentelemetry/resources
// File: tracing.js — require this BEFORE express, http, pg, etc.

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } =
    require('@opentelemetry/semantic-conventions');
const { OTLPTraceExporter } =
    require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } =
    require('@opentelemetry/auto-instrumentations-node');
const { ParentBasedSampler, TraceIdRatioBasedSampler } =
    require('@opentelemetry/sdk-trace-base');

const sdk = new NodeSDK({
    resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: 'payments',
        [SemanticResourceAttributes.SERVICE_VERSION]: '1.8.0',
        [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
    }),
    traceExporter: new OTLPTraceExporter({
        url: 'http://otel-collector:4317',
    }),
    sampler: new ParentBasedSampler({
        root: new TraceIdRatioBasedSampler(0.10),
    }),
    instrumentations: [getNodeAutoInstrumentations({
        // Skip noisy file-system spans in production
        '@opentelemetry/instrumentation-fs': { enabled: false },
    })],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Once tracing.js is loaded first (e.g. node -r ./tracing.js server.js), the auto-instrumentations wrap Express, the http client, pg, redis, kafkajs, ioredis, and dozens more. You get a populated trace tree with attributes on day one, before you write a single manual span.

Where the Collector fits

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

  # Strip PII from spans before they leave your network.
  attributes:
    actions:
      - key: user.email
        action: delete
      - key: http.request.header.authorization
        action: delete

  # Tail-based sampling: see ALL spans of a trace, then decide.
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 2000
    policies:
      - name: errors-keep-100pct
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-keep-100pct
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline-1pct
        type: probabilistic
        probabilistic: { sampling_percentage: 1.0 }

exporters:
  otlphttp/tempo:
    endpoint: https://tempo:4318
  otlphttp/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${HONEYCOMB_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes, tail_sampling, batch]
      exporters: [otlphttp/tempo, otlphttp/honeycomb]

Run the collector as a sidecar (per-pod), a DaemonSet (per-node), or a central gateway. The DaemonSet pattern is the production default — one collector per node, low latency for app exporters, and tail sampling that sees a meaningful slice of traffic.

Sampling

Why You Can’t Sample at 100%

The Problem: A single trace can have hundreds of spans. At 10k requests per second, 100% sampling produces millions of spans per second — tens of TB per day. Your storage bill explodes; query latency on the backend collapses.

The Solution: Sample. The art is sampling in a way that keeps the traces you actually need — errors, slow tails, unusual users — while throwing away the boring 99%.

Head-based vs tail-based

Property	Head-based	Tail-based
Decision point	At the root span, before the request runs	After all spans of the trace are collected
Decision based on	`trace_id` hash, request attributes	Outcome (error, latency, attributes)
Where it runs	SDK in the app	Collector or backend
Cost	Cheap — reject spans before creation	Expensive — ingest everything, then drop
Can it always keep errors?	No — the decision happens before the error is known	Yes — this is its main use case
Trace consistency	Guaranteed (parent decision propagates)	Guaranteed (decision is per-trace)

Common sampling strategies

Probabilistic (head) — keep N% of traces by hashing trace_id. Simple, predictable. Default starting point: 1–10%.
Rate-limited (head) — keep at most N traces/sec per service. Caps cost; useful at the edge.
Parent-based (head) — honor the upstream sampling decision (the 01 in traceparent). Without this, half your trace is sampled and half isn’t — useless.
Error-aware (tail) — keep 100% of traces with status=ERROR.
Latency-aware (tail) — keep 100% of traces over P95 latency.
Attribute-based (tail) — keep 100% of traces from VIP customers, new releases, debug headers, paid tiers.

Never run 100% sampling unmonitored in production

It works in dev and staging. The day a marketing campaign sends 10x normal traffic, your collector OOMs, your backend rejects writes, and you lose the very traces you were trying to capture. Set a probabilistic head sampler with a known cost plus tail-sampling rules that always keep errors and slow traces. That gives you the best of both.

Reading Traces in Practice

Knowing What “Bad” Looks Like

The Problem: A flame graph with 200 spans is intimidating. New engineers stare at it and don’t know what to fix.

The Solution: Three or four shapes show up over and over in slow traces. Once you can name them, you can fix them.

The shapes that signal trouble

Shape in the flame graph	What it usually means	Fix direction
Child span much wider than the work it does	Network or queueing time at the boundary	Co-locate, batch, or warm the connection pool
Long sequential chain of DB spans (N+1)	ORM lazy-loading per row	Batch the query, eager-fetch, or denormalize
Wide fan-out, tail dominates total	Slowest-of-N replica problem	Hedge requests, raise replica count, fix the slow one
Big idle gap inside a parent span	Synchronous lock or unrelated wait	Profile inside the span, add events to find it
Trace stops mid-call graph	Context propagation broken at that hop	Audit instrumentation on that boundary
Same span appears 50x in one trace	Retry storm or per-item loop	Bulk operation, exponential backoff, dedupe

Pay attention to the difference between span duration and self time (duration minus time spent in children). A parent span that is 5 s wide but spends 4.9 s in one child has 0.1 s of self time — the child is the problem, not the parent.

The N+1 in a flame graph

An N+1 query is unmissable in a trace: one parent span (e.g. GET /orders) with 50 narrow DB child spans stacked horizontally. Logs would never show this; they’d just show 50 successful queries. Traces show it as a comb pattern that takes the whole request width.

Storage Backends and Tools

The OTLP wire protocol means your apps don’t care which backend you pick. The backends differ in storage cost, query model, and how cleanly they integrate with metrics and logs.

Backend	Type	Storage	Picks for
Jaeger	Self-hosted OSS	Cassandra, Elasticsearch, OpenSearch, Badger	Teams that already run a search cluster; CNCF-friendly stack
Grafana Tempo	Self-hosted OSS	Object storage (S3, GCS) only	Cheap, scalable trace-as-cold-storage; pairs with Loki + Prometheus
Zipkin	Self-hosted OSS (older)	Cassandra, MySQL, Elasticsearch	Existing Zipkin shops; B3 ecosystems
Honeycomb	SaaS	Proprietary columnar store	High-cardinality querying, BubbleUp, observability-2.0 workflows
Datadog APM	SaaS	Proprietary	Single pane with metrics + logs + RUM; large enterprise
AWS X-Ray	Managed (AWS)	AWS-managed	AWS-native shops; tight Lambda / API Gateway integration
Lightstep / ServiceNow Cloud Observability	SaaS	Proprietary	Tail sampling at scale, change-impact analysis

How to choose

If you already run Grafana + Prometheus + Loki: Tempo. Same UI, same auth, S3-cheap storage.
If your engineers want to ask new questions interactively: Honeycomb. High-cardinality columnar query is its superpower.
If you want one vendor for traces, logs, metrics, RUM, and synthetics: Datadog APM — expect the bill to scale with span volume.
If you live in AWS and use Lambda heavily: X-Ray, at least for the Lambda parts. Bridge to a real backend for the rest.
If you want OSS without operating a search cluster: Tempo. Jaeger is fine; just budget for Elasticsearch/Cassandra ops.

Real-World Examples

Google — Dapper (2010). The paper that started the field. Dapper described how Google instruments every RPC, propagates a trace ID through every layer of its stack, and samples aggressively (often well below 1%) because the absolute volume is enough to keep traces statistically useful. The key insight in the paper: tracing has to be always-on, low-overhead, and application-transparent, or engineers will turn it off.

Twitter — Zipkin (2012). Twitter open-sourced its Dapper-inspired tracer. Zipkin defined the B3 propagation headers that dominated the next decade, built the basic UI vocabulary (trace search, span detail, dependency graph) that every later tool copied, and proved that distributed tracing was a generally useful tool, not a Google-only luxury.

Uber — Jaeger (2015). Uber needed something that scaled to thousands of services and donated the result to the CNCF. Jaeger added adaptive sampling (per-service, per-operation rates that the backend computes and pushes down to clients), and remains the reference open-source tracer.

Honeycomb — observability 2.0. Honeycomb’s framing is that the fundamental unit shouldn’t be the metric or the log line; it should be the wide, structured event — which is, basically, a span with a lot of attributes. Their pitch: keep raw events with high cardinality and you can ask questions you didn’t know you’d need to ask. This is why “adding more attributes to your spans” is a core debugging technique now, not a quaint suggestion.

OpenTelemetry — the CNCF merger. OpenTracing (2016) and OpenCensus (2017) competed for years. In 2019 they merged into OpenTelemetry, which is now the second-most-active CNCF project after Kubernetes. Every major vendor — Datadog, New Relic, Honeycomb, Splunk, Dynatrace, AWS, Google, Azure — ships OTel-native ingestion. The format war is over.

Best Practices

The short list

Instrument at the edge first. The API gateway or ingress is where every trace begins; if it’s instrumented, every downstream span hangs off it.
Use auto-instrumentation, then add the missing 10% manually. Don’t hand-roll spans for your HTTP client when an instrumentation library does it correctly already.
Set a service.name and version on every process. Without resource attributes, the trace UI shows you a tree of nameless boxes.
Always use parent-based sampling. A trace where half the spans are sampled and half aren’t is worse than no trace at all.
Combine head sampling for cost with tail sampling for signal. Errors and slow traces should be 100%; the boring baseline can be 1%.
Add high-cardinality attributes to spans. user.id, tenant.id, feature.flag, build.sha. These are how you slice traces during incidents.
Strip PII at the collector. Don’t leak emails, tokens, or PAN data to a third-party SaaS. Do it once in the collector, not in every app.
Link logs and traces with the trace_id. Inject trace_id into your structured log formatter so a log line jumps to its trace in one click.
Audit context propagation across queues. HTTP usually just works. Kafka, SQS, and cron jobs are where traces silently die.
Treat the collector as production infrastructure. It is. Capacity-plan it, alert on its queue depth, and run it close to your apps.

Trace IDs in logs are useful only if sampling agrees

If your logs include trace_id but the trace was dropped by the head sampler, the log line links to a 404 in the tracing UI. The fix: log the trace_id regardless, but only emit a “jump to trace” link when the trace_flags show the trace was sampled. Better yet, do tail sampling so error logs always have a corresponding trace.

The single most useful sentence about tracing

If your incident postmortem says “we eventually figured out it was Service C by reading logs on five hosts” — you needed a trace, and you didn’t have one (or had one and didn’t look). Distributed tracing is what turns “eventually” into “in two minutes.”