Skip to content

01 — Observability Fundamentals

Technical Overview

Observability is the property of a system that allows you to understand its internal state from the signals it emits externally. The definition most cited in the SRE community comes from Charity Majors (Honeycomb): "Observability means you can understand what is happening inside your system from the outside, in arbitrary ways, without having to ship new code to answer new questions." This is a systems-control-theory term repurposed for software: an observable system lets you reconstruct any internal condition from its outputs.

The key distinction from monitoring is directionality and flexibility. Monitoring is the practice of watching for known failure modes — you pre-define dashboards and alerts for anticipated problems. Observability is asking novel questions about unknown problems in real time. Monitoring answers "is this metric above threshold?" Observability answers "why is p99 latency for /checkout elevated only for users in region=eu-west-1 with payment_method=card on this specific pod?"


Prerequisites

  • Familiarity with microservices architecture
  • Basic understanding of time-series data
  • Exposure to HTTP APIs and distributed request flows
  • Comfort reading system metrics (CPU, memory, request rates)

Core Content

The Three Pillars

The three-pillar model (metrics, logs, traces) has become the dominant framing for observability tooling. Each pillar captures a different slice of system behavior, and they are complementary rather than redundant.

OBSERVABILITY PYRAMID
                                    /\
                                   /  \
                                  / T  \       Traces: highest context,
                                 / r a  \      highest cost, sampled
                                /  c e  \
                               /________\
                              /          \
                             /   Logs     \    Logs: per-event detail,
                            /             \   searchable but expensive
                           /_______________\
                          /                 \
                         /     Metrics       \  Metrics: cheapest,
                        /                     \ aggregated, always-on
                       /_______________________\

  Low cardinality     <---------breadth---------->  High cardinality
  High aggregation    <---------detail----------->  Low aggregation
  Always on           <---------cost------------->  Sampled / selective

Metrics are numeric measurements aggregated over time. A counter of HTTP requests per second is a metric. Metrics are cheap: a single time-series stores one float64 per scrape interval regardless of request volume. The cost is loss of context — you know that p99 latency is 800ms, but not which user, which endpoint version, or which database query caused it. Metrics are essential for alerting (SLO burn rate) and dashboards. The cardinality problem bites when you add too many label dimensions: a metric with 5 labels each having 100 unique values produces 10 billion possible series.

Logs are timestamped records of discrete events. A log entry can carry arbitrary context: user ID, request parameters, stack traces, downstream call results. Logs are the most familiar data type. The problem is cost at scale: a service handling 100,000 req/s producing 500-byte JSON log entries generates ~50 MB/s of raw log data, ~4 TB/day. Storage, indexing, and querying at that volume requires careful architecture. Unstructured logs (printf-style text) are a maintenance burden; structured logs (JSON with consistent field names) enable reliable query and extraction.

Traces represent the journey of a single request across multiple services. A trace is composed of spans — one per service or significant operation — linked by a shared trace ID. Traces provide causality: you can see that the checkout service called the inventory service, which called the database, and the database query took 650ms of the 800ms total. Traces require propagation of context (trace ID, span ID) through all calls. At high request rates, tracing every request is impractical, so sampling is used.

Pillars Are Not Independent

The three pillars interact in ways that make them more powerful combined:

  • Logs → Metrics: You can derive metrics from log streams (log-based metrics). Counting ERROR log lines gives an error rate metric. Tools like Prometheus mtail, Grafana Loki's metric queries, and CloudWatch Metric Filters do this.
  • Metrics → Logs: When a metric alert fires, you want to drill into logs for that time window to understand why. This requires correlation by time range and service labels.
  • Traces ↔ Logs: A trace span can carry a log entry embedded as a span event. A log line can carry a trace ID, allowing you to pivot from a log query result to the full distributed trace.
  • Metrics ↔ Traces via Exemplars: An exemplar is a specific trace ID attached to a metric sample. Prometheus histograms support exemplars. When you observe a p99 spike on a histogram, the exemplar lets you jump directly to a representative trace that was sampled during that spike.

MELT Framework

MELT (Metrics, Events, Logs, Traces) is an extension of the three-pillar model popularized by New Relic. Events are discrete records with rich attributes that don't fit neatly into logs (e.g., a deployment event, a feature flag change, a database schema migration). Events act as change markers on time-series graphs and are essential for correlating "the p99 degraded after this deployment at 14:32 UTC."

Cardinality Problem

Cardinality is the number of unique values a dimension can take. High-cardinality dimensions are user IDs, request IDs, URLs with path parameters, or IP addresses. In metric systems like Prometheus, each unique combination of label values creates a separate time-series. High-cardinality labels make metric storage explode:

metric{service="api", region="us-east-1", user_id="u-123456"} 1
metric{service="api", region="us-east-1", user_id="u-789012"} 1
# 10 million users = 10 million series for one metric

This is why metrics are inherently low-cardinality aggregations. For high-cardinality analysis (debugging per-user, per-request behavior), traces and logs are the appropriate tools.

Observability-Driven Development

Observability-driven development (ODD) means building instrumentation as part of feature development, not as an afterthought. Before shipping a feature, you define what signals it should emit and what questions you need to be able to answer post-deployment. This parallels test-driven development but for production behavior. The practice requires:

  1. Adding structured log statements with relevant context at every branch
  2. Instrumenting key operations as trace spans
  3. Exposing metrics for SLO-relevant behaviors
  4. Defining SLO dashboards before going to production

Observability for Microservices

Traditional monitoring worked adequately for monolithic applications: you knew which host a request hit, you could look at its logs, you could correlate CPU and application metrics on one machine. In microservices, a single user request may traverse 10-30 services, each running multiple replicas across multiple availability zones. Failures are no longer isolated to one component — they are emergent properties of interactions. A downstream service's database connection exhaustion, combined with a misconfigured circuit breaker, combined with a retry storm, produces a symptom (elevated latency in the frontend) with no single root cause visible from any one service's metrics alone.

Observability addresses this by providing cross-service visibility through distributed tracing, correlated log queries across services, and service-graph analysis from trace data.

Exemplars: Linking Metrics to Traces

Exemplars (OpenMetrics standard, supported in Prometheus 2.26+) attach a trace ID to a specific histogram observation. When Prometheus scrapes a histogram, high-value observations can carry an exemplar field:

# HELP http_request_duration_seconds HTTP request duration histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 24054 # {trace_id="4bf92f3577b34da6"} 0.054
http_request_duration_seconds_bucket{le="0.5"} 33444 # {trace_id="8d4a2c9b12fe3a01"} 0.387
http_request_duration_seconds_bucket{le="1.0"} 33444 # {trace_id="9c3b1d8e56fa4b02"} 0.922

In Grafana, when you visualize a histogram panel, exemplar dots appear on the graph. Clicking a dot opens the associated trace in Tempo/Jaeger, giving you the specific request that contributed to that latency bucket.


Historical Context

The term "observable system" originates from control theory (Rudolf Kalman, 1960), where it described whether a system's state vector could be determined from its output. Application of the concept to distributed systems gained traction around 2016-2018 as microservice architectures became mainstream at mid-to-large-scale companies.

The "three pillars" framing became industry standard through the work of Ben Sigelman (LightStep), Peter Bourgon's blog posts, and the Distributed Tracing book (2020). Charity Majors and Liz Fong-Jones at Honeycomb pushed the observability-vs-monitoring distinction hard, arguing that the industry was investing in better monitoring when it needed a fundamentally different approach for complex distributed systems.

The OpenTelemetry project (2019) was the first serious attempt to standardize instrumentation APIs across all three pillars, unifying OpenCensus and OpenTracing.


Production Examples

Shopify restructured their observability stack around high-cardinality events after their 2020 Black Friday incidents. Their Snowplow-based event pipeline allows arbitrary queries over 100+ dimensions per request without pre-aggregation.

Cloudflare uses a metrics-first approach for global network observability (they handle ~20% of internet traffic), supplemented by eBPF-based packet-level tracing for specific investigations. Their blog post "How we scaled our observability" (2023) details the Prometheus federation and Thanos setup they use for global metric aggregation.

Netflix developed their Hollow and Atlas systems for global metric storage with high-cardinality support, having hit Prometheus's limits at their scale.


Debugging Notes

When an alert fires and you start investigating:

  1. Start with metrics: Identify the SLI that degraded, the time window, and which services/regions are affected. Use Prometheus or your metric store.
  2. Pivot to traces: Use trace search to find slow or erroring traces in the affected window. Filter by service, error status, or latency percentile.
  3. Drill into logs: From a specific trace, follow span IDs to correlated logs for detailed context (SQL queries, external API responses, error messages).
  4. Cross-reference with events: Check for deployments, config changes, or infrastructure events in the same time window.

Common mistake: starting with logs when the problem spans multiple services. Logs are service-local; traces are cross-service. For distributed problems, traces should be step two, not step four.


Security Implications

  • PII in observability data: Traces and logs frequently carry user IDs, IP addresses, email addresses, and request parameters. GDPR and HIPAA require that PII not be retained in logs beyond defined periods and that it be accessible only to authorized personnel. Log scrubbing (redacting card numbers, SSNs, emails) must happen at ingestion, not at query time.
  • Observability platform as attack surface: The observability stack (Prometheus, Grafana, Jaeger) often has weaker authentication than production services. Compromised observability tooling gives an attacker full visibility into system topology, traffic patterns, and internal API shapes.
  • Trace context injection: Malicious HTTP clients can inject forged traceparent headers. Systems that trust incoming trace IDs without validation can be manipulated into attributing malicious traffic to legitimate users.
  • Metric cardinality attacks: An attacker who can control a label value (e.g., a URL path parameter that becomes a metric label) can cause cardinality explosion, exhausting Prometheus memory and causing a DoS.

Performance Implications

  • Metrics scraping adds <1ms overhead per scrape endpoint in typical Prometheus setups; negligible.
  • Structured logging with JSON serialization adds ~2-5 microseconds per log statement in Go; more in Java due to GC pressure.
  • Distributed tracing with 100% sampling adds ~50-200 microseconds of latency per span in typical gRPC-based systems (header propagation + span creation + export).
  • Cardinality explosion in Prometheus can cause OOM. Each time-series consumes ~3-5 KB of RAM in the Prometheus head block. 10 million series = 30-50 GB RAM.

Failure Modes and Real Incidents

GitLab database incident (2017): A database admin accidentally deleted the wrong PostgreSQL replica. The incident revealed gaps in monitoring: backup jobs were failing silently, and no alert existed for backup age. This is a classic monitoring-vs-observability failure — the metric "backup count" would not have caught it; "time since last successful backup validation" would have.

Facebook October 2021 outage: The BGP route withdrawal that took down Facebook for 6 hours also took down their internal observability infrastructure (it was deployed in the same network fabric). This illustrates a critical design principle: observability infrastructure must be in a separate failure domain from the systems it observes.

Honeycomb war story (Charity Majors, 2018): A p99 latency spike that metrics showed was in "database queries" turned out to be caused by a specific combination of user_type=trial, feature_flag=new_search, and db_shard=shard-07. No pre-aggregated metric would have revealed this combination. Only high-cardinality event analysis (Honeycomb's model) made it debuggable in minutes rather than hours.


Modern Usage

  • OpenTelemetry is now the standard instrumentation API; all major vendors (Datadog, Dynatrace, New Relic, Honeycomb) accept OTLP.
  • Grafana stack (Prometheus + Loki + Tempo + Grafana) provides a fully open-source, tightly integrated three-pillar solution. Grafana's "Explore" view enables metric-to-trace-to-log correlation workflows.
  • Continuous profiling is emerging as the fourth pillar (see 08-continuous-profiling.md), providing always-on CPU and memory profiles correlated with traces.
  • OpenFeature + observability: Feature flag change events are increasingly first-class observability events, enabling correlation of flag rollouts with metric changes.

Future Directions

  • AI-assisted root cause analysis: Tools like Datadog Watchdog and Honeycomb's AI Query Assistant attempt to automatically correlate anomalies across pillars. The challenge is false positives and explainability.
  • Causality inference: Moving from correlation (two metrics spike together) to causality (service A's latency caused service B's error rate) using causal graph learning on trace data.
  • eBPF as the universal collector: eBPF promises zero-code instrumentation for all three pillars — metrics, logs, and traces — extracted directly from kernel execution without any application code changes.
  • Observability for ML systems: Standard metrics (latency, error rate) don't capture ML-specific failure modes (data drift, model degradation, training/serving skew). ML observability is an active research area.

Exercises

  1. Three-pillar audit: Take an existing service you operate. For each of the three pillars, identify: what data you are currently collecting, what questions you cannot currently answer, and what you would need to add to answer them.

  2. Cardinality calculation: Given a metric with labels service (10 values), region (5 values), status_code (50 values), user_tier (3 values), endpoint (200 values): calculate the total number of possible time-series. Now add a request_id label (millions of values) and recalculate. Explain why request_id is never an appropriate metric label.

  3. Exemplar chain: Set up a local environment with Prometheus (with exemplars enabled), an app instrumented with OpenTelemetry (exporting traces to Tempo), and Grafana. Generate load, trigger a slow request, and trace the path from a histogram spike in Grafana → exemplar dot → specific trace in Tempo → correlated log lines in Loki.

  4. Observability failure mode: Design an observability architecture for a payments service where the observability infrastructure itself must remain operational if the primary datacenter fails. Specify: where metrics, logs, and traces are stored, how they survive a network partition, and what is sacrificed (completeness, latency, cost).

  5. ODD practice: For a new feature ("user can export their data to CSV"), write the observability requirements before any code: what metrics to expose, what log fields every log statement must include, what spans to create in the trace, and what SLO to define for the feature.


References

  • Majors, Charity, Fong-Jones, Liz, and Miranda, George. Observability Engineering. O'Reilly, 2022.
  • Beyer, Betsy et al. Site Reliability Engineering. Google, 2016. Chapter 6 (Monitoring Distributed Systems).
  • Sigelman, Ben et al. "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure." Google Technical Report, 2010.
  • OpenTelemetry Project: https://opentelemetry.io/docs/
  • Bourgon, Peter. "Metrics, tracing, and logging." https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
  • Gregg, Brendan. Systems Performance: Enterprise and the Cloud. 2nd ed. Addison-Wesley, 2020.
  • OpenMetrics specification (exemplars): https://openmetrics.io/