04 — Distributed Tracing

Technical Overview

Distributed tracing is the practice of recording the path of a request as it traverses a system composed of multiple services, threads, and processes. A single user action — clicking "checkout" — may fan out through a dozen microservices: API gateway, auth service, cart service, inventory service, payment service, fraud detection, notification service, order persistence. Each transition adds latency. Errors may originate in one service and manifest as failures in another. Without distributed tracing, debugging cross-service latency or error chains requires log correlation by time and guesswork. With tracing, you see the full causal chain: which service contributed which latency, where the error originated, and exactly which code path was executed.

Prerequisites

Understanding of HTTP headers and request propagation
Familiarity with microservices architecture
Basic knowledge of observability concepts (see 01-observability-fundamentals.md)
Comfort reading structured data (JSON, protobuf)

Core Content

The Trace Data Model

A trace represents the end-to-end journey of a single request. It is composed of spans. Each span represents one unit of work: an HTTP call, a database query, a cache lookup, a message queue publish.

DISTRIBUTED TRACE: checkout request (trace_id: 4bf92f3577b34da6)

Timeline ──────────────────────────────────────────────────────────▶
         0ms          100ms       200ms       300ms       400ms  450ms

API Gateway   ┌──────────────────────────────────────────────────────┐
              │ span: handle_request                        450ms total│
              └──────────────────────┬───────────────────────────────┘
                                     │ HTTP call
Auth Service               ┌─────────┴──────────┐
                           │ span: verify_token  │ 40ms
                           └────────────────────┘
                                                  │ HTTP call
Cart Service                          ┌───────────┴───────────────┐
                                      │ span: get_cart   150ms    │
                                      └──┬──────────────┬─────────┘
                                         │              │
                                    ┌────┴────┐   ┌────┴────────┐
                                    │Redis GET│   │Postgres     │
                                    │ 5ms     │   │ SELECT 80ms │
                                    └─────────┘   └─────────────┘
                                                             │ HTTP call
Payment Service                                    ┌─────────┴──────────────┐
                                                   │ span: charge_card 150ms│
                                                   └────────────────────────┘

WATERFALL VIEW:
  api-gateway:handle_request          ████████████████████████ 450ms
    auth:verify_token                 ███ 40ms
    cart:get_cart                              █████████ 150ms
      redis:GET                               █ 5ms
      postgres:SELECT                            ████ 80ms
    payment:charge_card                                   █████ 150ms

Each span contains: - trace_id: shared across all spans in the trace (128-bit hex) - span_id: unique to this span (64-bit hex) - parent_span_id: links to calling span (absent for root span) - operation_name: human-readable name for the operation - start_time, end_time: nanosecond precision timestamps - status: OK, ERROR, or UNSET - attributes: key=value pairs (arbitrary metadata) - events: timestamped log entries within the span - links: references to other traces (e.g., async operations, batch jobs)

Span Context Propagation

For spans across service boundaries to join the same trace, the context (trace ID + span ID) must be propagated in the request headers. This is called context propagation.

Service A ──HTTP GET /cart──────────────────────────────▶ Service B
          headers:
            traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
                         ^^ version                              ^^ flags (sampled)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                              trace_id (128 bits)
                                                   ^^^^^^^^^^^^^^^^
                                                   parent span id (64 bits)

W3C TraceContext Standard (RFC, 2021): defines two HTTP headers: - traceparent: {version}-{trace_id}-{parent_id}-{flags} - tracestate: vendor-specific key=value pairs for additional propagation data (e.g., sampling decisions, vendor tags)

B3 Headers (Zipkin legacy): separate headers X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, X-B3-Sampled. Many legacy services still use B3; propagators can handle both formats simultaneously.

Baggage: W3C Baggage (baggage: user_id=u-12345, tenant=acme) allows passing arbitrary key-value pairs through the full trace context. Use sparingly — baggage adds overhead to every request in the trace.

Dapper: The Foundational Paper

Google published "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" in 2010. Dapper defined the modern trace model (trace/span/annotation) and identified the key engineering challenges: - Ubiquitous deployment: tracing must be always-on, not opt-in, to be useful for debugging production incidents - Low overhead: must not perturb the system being observed (<0.01% CPU overhead at Google's scale) - Application transparency: instrumentation via library hooks, not application code changes - Scalability: Google's system collected billions of spans/day

Dapper's architecture: spans were written to local disk by each service, then collected by Dapper daemons, written to Bigtable. Their sampling strategy was simple: 1/1024 of all root transactions sampled, giving statistical coverage without full overhead.

Key insight from Dapper: even with 1/1024 sampling, at Google's request rate, billions of traces per day were collected. This was sufficient for statistical performance analysis. For debugging specific incidents, sampling is less useful — you may not have a sample of the specific failing request.

Jaeger and Zipkin

Zipkin (Twitter, 2012): first widely-adopted open-source distributed tracing system. Based on Dapper's model. Uses B3 headers. Storage backends: Cassandra, Elasticsearch. Still widely deployed; accepts OTLP via a translation layer.

Jaeger (Uber, 2015 → CNCF 2017): built on similar concepts to Zipkin but with better scalability story. Originally used its own SDKs; now recommends OTel SDKs with OTLP export to Jaeger's collector. Storage backends: Cassandra, Elasticsearch, Badger (embedded). Jaeger v2 (2024) is built on the OTel Collector codebase.

Grafana Tempo (2021): designed explicitly for cost-efficient trace storage at scale. Stores traces in object storage (S3/GCS) indexed only by trace ID. No full-text search by default — queries must come from Grafana dashboards using trace ID links from metrics/logs. This design choice makes it dramatically cheaper than Elasticsearch-backed Jaeger for high-volume deployments.

Trace Sampling

The fundamental tension in tracing: 100% sampling provides complete visibility but is expensive (storage, export bandwidth, query overhead). Sampling reduces cost but may miss the interesting traces.

Head Sampling (probabilistic): the sampling decision is made at the root span (the entry point), before any downstream work. Simple to implement, low overhead. Problem: the decision is made before you know if the request will be slow or erroneous.

# Simple head sampling: sample 10% of all requests
sampler = TraceIdRatioBased(0.10)

Tail Sampling: hold spans in memory until the trace completes, then apply sampling policies based on the full trace. Keep all errors, keep all slow traces, sample the rest. Requires a stateful component (OTel Collector with tailsampling processor, or a dedicated service like Jaeger's remote sampling endpoint).

Parent-based sampling: if an incoming request has a sampled flag in its traceparent, honor it. If not sampled, don't create spans. This ensures consistent sampling across all services for a given trace — avoids partial traces where Service A sampled but Service B didn't.

Rate limiting: sample at most N traces per second, regardless of total traffic. Ensures stable storage growth even during traffic spikes.

Production recommendation: use head sampling at 10-100% for low-to-medium traffic services, and tail sampling with an OTel Collector for high-traffic services. Always keep 100% of error traces.

Trace Analysis

Once traces are collected, the analysis capabilities are:

Latency attribution: which span contributes what percentage of total request latency? Sum child span durations to find gaps (time in parent not accounted for by children = computation in that service).

Error root cause: if the root span has status ERROR and the payment span is the first ERROR span in the chain, the error originated in the payment service (even if the user sees an error from the API gateway).

Service dependency graph: extract from traces by counting unique (caller_service, callee_service) pairs. Build a runtime service topology map. This is more accurate than manually-maintained architecture diagrams.

Percentile analysis: aggregate latency across many traces to compute p50/p95/p99 per operation. This is available in Jaeger's "Deep Dependency Graph" and Tempo's TraceQL.

Flame Graphs for Traces vs CPU Flame Graphs

Both use the flame graph visualization format (Brendan Gregg, 2011), but they represent different things:

CPU Flame Graph: x-axis = cumulative CPU sample time, y-axis = call stack depth. Width of a frame = proportion of total CPU time spent in that function.
Trace Flame Graph (Gantt chart / waterfall): x-axis = wall clock time, y-axis = span depth (nested calls). Width = span duration. This shows sequential vs parallel execution, gaps, and latency contributors.

The trace waterfall diagram above is the trace equivalent of a flame graph. Tools like Jaeger and Tempo render this natively.

TraceQL: Querying Traces

Grafana Tempo 2.0 introduced TraceQL, a query language for traces (analogous to PromQL for metrics, LogQL for logs):

# Find all traces with an error span in the payment service
{ resource.service.name = "payment-service" && status = error }

# Find slow traces (>500ms) with a database span
{ duration > 500ms && span.db.system = "postgresql" }

# Find traces where a specific user's request errored
{ resource.service.name = "api-gateway" && span.user.id = "u-12345" && status = error }

# Count traces by status per service
{ resource.service.name =~ ".*-service" } | rate() by (resource.service.name, status)

Historical Context

The distributed systems community at Google (Jeff Dean, Sanjay Ghemawat era) built Dapper as infrastructure for debugging production issues at scale — not as a product. The Dapper paper (2010) was the first systematic description of the challenges of distributed tracing at internet scale and remains required reading.

Twitter's Zipkin (2012) brought Dapper's ideas to the open-source ecosystem. Uber's massive microservices architecture (1000+ services by 2016) motivated Jaeger's development with better scalability than Zipkin. The OpenTracing project (Ben Sigelman, 2016) attempted to standardize tracing APIs, preventing vendor lock-in. When OpenCensus and OpenTracing merged into OpenTelemetry (2019), distributed tracing finally had a unified, vendor-neutral standard.

W3C TraceContext standardization (2021) solved the propagation format fragmentation — before standardization, every vendor used different header names, making cross-organization tracing impossible without translation.

Production Examples

# Query Jaeger for traces from the checkout service with errors in last 1 hour
curl "http://jaeger-query:16686/api/traces?service=checkout-service&tags=%7B%22error%22%3A%22true%22%7D&limit=20&lookback=1h" | jq '.data[0].spans[] | {operationName, duration, tags}'

# Search Tempo for slow traces via TraceQL
curl -G http://tempo:3200/api/search \
  --data-urlencode 'q={ duration > 1s && resource.service.name = "payment-service" }' \
  --data-urlencode 'limit=10'

# Get a specific trace by ID
curl http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736 | jq .

# Check Jaeger storage backend health
curl http://jaeger-collector:14269/  # admin port

Debugging Notes

Traces appear disconnected (spans from different services don't join): 1. Check that traceparent header is being forwarded. Dump HTTP headers: curl -v http://service-b/endpoint from within Service A. 2. Check that all services use the same propagator (W3C TraceContext). Mixed B3/W3C will break trace joining. 3. Verify clock synchronization — if two services have >1s clock skew, Jaeger may display spans out of order, but they still join correctly.

Root span missing (see leaf spans but not the full trace): - The root span (from the API gateway or frontend) may have been dropped by sampling. Increase sampling rate for the entry-point service. - The root span may still be in-flight when you query. Tempo/Jaeger have ingestion delay; wait 10-30s after request completion.

Very long or "zombie" spans: - A span with 30+ minute duration usually means end_time was never recorded (context lost in a background goroutine). Ensure spans are always ended, preferably with defer span.End() in Go.

Security Implications

Trace context headers (traceparent) from external clients must be validated. An attacker can inject a traceparent with a valid format to make malicious requests appear as continuations of legitimate traces, polluting trace data.
Span attributes may contain sensitive data: SQL queries (with parameters), HTTP request/response bodies, authorization tokens. Configure the OTel Collector's attributes processor to redact sensitive keys.
Trace storage backends (Jaeger, Tempo) contain detailed system topology information. This is a high-value target for reconnaissance. Apply network policies and access controls.
Sampling rates can be manipulated: if an attacker knows only error traces are sampled at 100%, they can craft requests that trigger errors to guarantee they appear in your trace store, potentially flooding it.

Performance Implications

Span creation: ~1-5 microseconds per span in Go, ~10-50 microseconds in Java.
Header propagation: W3C traceparent header is 55 bytes. Negligible compared to payload sizes.
Export via OTLP gRPC with batching: typically adds <1ms to request latency with a background goroutine.
Tail sampling memory: ~1-2KB per span in memory. 100k spans in memory = 100-200MB. Size your Collector accordingly.
Elasticsearch-backed Jaeger at 1 billion spans/day requires substantial infrastructure (multi-node ES cluster). Tempo's object storage model is 10-100x cheaper for the same volume.

Failure Modes and Real Incidents

Uber trace store saturation (2017 internal report): Jaeger's Cassandra cluster was overwhelmed during a traffic spike. Traces were dropped silently. The incident revealed that trace storage was not in the SLO for the tracing system itself — traces could be lost during exactly the incidents when they were most needed. Fix: dedicated, over-provisioned trace storage with capacity alerts.

Sampling bias in incident investigation: A team investigating a latency regression could not reproduce the issue in traces because head sampling was set at 1%. The slow requests (p99) were underrepresented in the 1% sample. Fix: use adaptive sampling (higher rate for slow requests) or tail sampling.

Missing traces due to context propagation through message queues: A Kafka consumer did not propagate traceparent from the message headers. All message-processing traces were disconnected from the producer traces. Fix: OTel Kafka instrumentation automatically extracts and propagates traceparent from message headers; use the official instrumentation library.

Modern Usage

Tempo + Grafana is the dominant open-source trace storage + visualization stack in production Kubernetes deployments.
Jaeger v2 (2024): rebuilt on OTel Collector, adding OTLP-native ingestion and better scaling characteristics.
TraceQL (Grafana Tempo): enables rich programmatic trace search without full-text indexing, making trace analysis at scale affordable.
Service Graph metrics from traces: Tempo and Jaeger both extract RED (Rate, Errors, Duration) metrics from trace data, providing service-level metrics without additional instrumentation.
Exemplar integration: Prometheus histograms with exemplars link directly to Tempo traces, enabling metric-to-trace drill-down in Grafana.

Future Directions

Continuous tracing for database query plans: extending trace spans to include actual database execution plans, not just the SQL statement.
ML-based anomaly detection on traces: automatically identifying unusual span patterns (unexpected call sequences, new service dependencies) as potential indicators of incidents.
eBPF-based zero-instrumentation tracing: Pixie, Hubble, and similar tools extract spans directly from the kernel network stack without any application instrumentation, providing automatic service graph discovery.

Exercises

Trace waterfall analysis: Using Jaeger or Tempo, find the 10 slowest traces for a service over the last 24 hours. For each trace, identify: which span contributed the most latency? Was there any parallel fan-out that was efficient? Is there serial work that could be parallelized?
Propagation verification: Write a test that makes an HTTP request to Service A, which calls Service B. Assert that the traceparent header is present in Service B's incoming request and contains the same trace_id as Service A's outgoing request. Test what happens when you deliberately break context propagation.
Tail sampling configuration: Set up an OTel Collector with tail sampling. Configure it to keep 100% of error traces, 100% of traces >500ms, and 1% of all other traces. Generate 1000 traces (mix of fast/slow/error). Count how many traces appear in Jaeger and verify the sampling ratios match expectations.
Service graph extraction: Write a script that reads traces from Tempo or Jaeger API, extracts all unique (parent_service, child_service) pairs, and outputs a DOT graph. Render it with Graphviz. Compare with your architecture diagram — are there unexpected service dependencies?
Clock skew impact: Deliberately introduce a 5-second clock skew between two services (by adjusting system time in a container). Observe how Jaeger/Tempo handles out-of-order spans. Does the trace still visualize correctly? What is the maximum skew each system tolerates?

References

Sigelman, Ben et al. "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure." Google Technical Report, 2010. https://research.google.com/archive/papers/dapper-2010-1.pdf
W3C Trace Context Recommendation: https://www.w3.org/TR/trace-context/
Jaeger Documentation: https://www.jaegertracing.io/docs/
Grafana Tempo Documentation: https://grafana.com/docs/tempo/
OpenTelemetry Tracing Specification: https://opentelemetry.io/docs/specs/otel/trace/
Gregg, Brendan. "Flame Graphs." http://brendangregg.com/flamegraphs.html
Kim, Cindy et al. "Distributed Tracing — We've Been Doing It Wrong." Honeycomb blog.
Yahn, Ted. Distributed Systems Observability. O'Reilly, 2018. Chapter 4.