Section 23: Observability — Overview
Section Purpose and Scope
This section covers observability engineering at systems depth: the internal mechanics of metrics collection and storage, the data models underlying distributed tracing, log aggregation pipelines at scale, and the emerging eBPF-based approaches that make instrumentation optional. Observability is distinguished from monitoring by its ability to answer novel questions about system state without requiring new instrumentation to be deployed. This section builds that capability by explaining how each signal type works, where it comes from, and how to design alerting and dashboarding that reduces cognitive load rather than increasing it.
Prerequisites
- Section 15: Networking (TCP/IP, for understanding network-level observability)
- Section 17: Distributed Systems (causality, clock synchronization, trace propagation)
- Section 20: Containers (container labeling, cgroup metrics)
- Section 22: Kubernetes Internals (pod annotations, service discovery for scraping)
Learning Objectives
- Explain the three pillars model and its limitations; articulate why correlation between signals matters.
- Describe Prometheus's pull model, TSDB storage format, and PromQL evaluation semantics.
- Explain OpenTelemetry's architecture: SDK, collector pipeline, exporters, and context propagation.
- Trace a distributed request through span propagation, parent-child relationships, and trace assembly.
- Design a log aggregation pipeline using Loki or ELK with appropriate retention and index strategy.
- Articulate how eBPF-based tools provide observability without application code changes.
- Write SLI/SLO definitions, calculate error budget burn rates, and design appropriate alerts.
- Identify anti-patterns in dashboard and alerting design.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Signal Generation │
│ │
│ Application Code Kernel / eBPF Agent/Sidecar │
│ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐ │
│ │ OTel SDK │ │ kprobes, uprobes │ │ Node agent │ │
│ │ Manual spans │ │ tracepoints │ │ (Datadog, │ │
│ │ Auto-instrument │ │ XDP, tc hooks │ │ Vector, │ │
│ │ Metrics API │ │ perf_events │ │ Fluent Bit)│ │
│ └────────┬────────┘ └────────┬──────────┘ └──────┬──────┘ │
└───────────┼─────────────────────┼────────────────────── ┼───────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ Receivers → Processors → Exporters │
│ (OTLP, Jaeger, Prometheus, Zipkin receivers) │
│ (batch, filter, attribute, tail-sampling processors) │
│ (Prometheus remote write, Loki, Tempo, OTLP exporters) │
└──────────────────────────┬──────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌────────────────┐ ┌──────────────┐
│ Prometheus │ │ Loki / ELK │ │ Tempo / │
│ + Thanos │ │ (Logs) │ │ Jaeger / │
│ (Metrics) │ │ │ │ Zipkin │
│ TSDB │ │ Chunks + │ │ (Traces) │
│ WAL + head │ │ inverted index│ │ │
└──────┬──────┘ └───────┬────────┘ └──────┬───────┘
│ │ │
└──────────────────▼────────────────────┘
┌─────────────────┐
│ Grafana │
│ (unified UI) │
│ Dashboards, │
│ Alerts, │
│ Correlations │
└─────────────────┘
Distributed Trace Structure:
Request ──────────────────────────────────────────────────── time
│
└── Span: frontend [TraceID=abc, SpanID=001]
│
├── Span: api-gateway [SpanID=002, ParentID=001]
│ │
│ ├── Span: auth-service [SpanID=003, ParentID=002]
│ │
│ └── Span: user-service [SpanID=004, ParentID=002]
│ │
│ └── Span: postgres [SpanID=005, ParentID=004]
│
└── Span: cdn-lookup [SpanID=006, ParentID=001]
Key Concepts
- Three Pillars of Observability: Metrics (aggregated numeric measurements over time), Logs (discrete events with context), Traces (causally linked spans across service boundaries). No single pillar is sufficient; correlation between them enables diagnosis.
- Prometheus Data Model: Time series identified by metric name + label set (key-value pairs). Four metric types: Counter (monotonically increasing), Gauge (arbitrary value), Histogram (bucketed distribution), Summary (client-side quantiles). Pull model: Prometheus scrapes
/metricsendpoints. - Prometheus TSDB: Head block in memory (WAL-backed), compacted to immutable blocks on disk. Series indexed by label inverted index. Samples stored in chunks (delta/XOR encoding). High cardinality label values are the primary scaling constraint.
- PromQL: Functional query language. Instant vectors, range vectors, aggregation operators.
rate(),histogram_quantile(),topk(), recording rules for expensive queries. - OpenTelemetry (OTel): CNCF project unifying metrics, logs, and traces under a single API/SDK and wire protocol (OTLP). Replaces fragmented vendor SDKs. Context propagation via W3C TraceContext headers.
- Distributed Tracing: A trace is a DAG of spans. Each span records operation name, timestamps, attributes, events, and links. Propagation: TraceID + SpanID passed in HTTP headers (W3C TraceContext) or gRPC metadata.
- Tail Sampling: Sampling decision made after the trace is complete (vs head sampling at request start). Allows keeping 100% of traces with errors or high latency while sampling normal traffic. Requires buffering.
- Loki: Log aggregation system that indexes only labels (like Prometheus), not log content. Log content stored compressed in object storage (chunks). LogQL for querying. Low cost at scale vs Elasticsearch.
- ELK Stack: Elasticsearch (inverted full-text index, JSON documents), Logstash (pipeline), Kibana (UI). High cardinality full-text search at cost of more storage and operational complexity.
- eBPF Observability: Programs attached to kernel tracepoints, kprobes, uprobes extract metrics and traces without modifying application code. Pixie, Hubble, Parca use eBPF for automatic service map generation and continuous profiling.
- Continuous Profiling: Always-on low-overhead CPU/memory profiling. pprof format (Go). Pyroscope, Parca aggregate flame graphs over time. Enables regression detection without manual profiling.
- SLI (Service Level Indicator): Quantitative measure of service behavior. Availability SLI = successful requests / total requests.
- SLO (Service Level Objective): Target for an SLI. "99.9% of requests < 200ms over 30 days."
- Error Budget: 100% - SLO. Remaining budget drives deployment risk decisions. Burn rate alerts detect budget exhaustion before it occurs.
- Multiwindow/Multi-burn-rate Alerting: Alert when error budget is burning fast over both a short window (1h) and a long window (6h), at appropriate thresholds. Reduces alert fatigue vs threshold-only alerting.
Major Historical Milestones
| Year | Event |
|---|---|
| 2003 | Google Dapper tracing system (internal) — distributed tracing concept |
| 2010 | Graphite + StatsD — push-model metrics for web applications |
| 2012 | Twitter open-sources Zipkin — first widely used open-source tracer |
| 2012 | Elasticsearch / Kibana 0.x — log search gains traction |
| 2015 | Prometheus 1.0 released by SoundCloud; CNCF incubation begins |
| 2016 | OpenTracing specification created (CNCF) |
| 2017 | Google publishes "Monarch" monitoring system paper |
| 2018 | Jaeger v1.0 (Uber's tracer) donated to CNCF |
| 2019 | OpenTelemetry announced — merge of OpenTracing + OpenCensus |
| 2019 | Thanos and Cortex enable Prometheus long-term storage at scale |
| 2020 | Grafana Tempo released (distributed tracing backend) |
| 2020 | Grafana Loki GA — low-cost log aggregation |
| 2021 | OpenTelemetry tracing reaches stability; Collector widely adopted |
| 2022 | OTel metrics specification stable; eBPF observability tools mature |
| 2023 | OTel logs stable; profiling signal added to OTel spec |
| 2024 | Continuous profiling integration in major observability platforms |
Modern Relevance
Observability is the engineering discipline that enables operating complex distributed systems. As systems decompose into microservices, the blast radius of any individual failure shrinks but the diagnostic complexity grows — a 500ms p99 latency regression can originate in any of dozens of services. OpenTelemetry has effectively won the instrumentation standards war; building new systems without OTel integration is architectural debt.
eBPF-based observability (no instrumentation required, kernel-level visibility) is transforming the baseline: teams can now have production CPU flame graphs, network flow maps, and auto-generated service dependency maps without any application code changes. This changes the economics of observability significantly.
SLO-based alerting is replacing threshold alerting in mature organizations because it ties operational burden directly to user experience impact. Error budget burn rate alerts are now considered best practice.
File Map
23-observability/
├── 00-overview.md ← this file
├── 01-observability-theory.md ← pillars, cardinality, correlation, maturity model
├── 02-opentelemetry.md ← SDK, Collector, OTLP, context propagation
├── 03-prometheus-internals.md ← TSDB, scraping, PromQL, Thanos/Cortex
├── 04-grafana-stack.md ← Loki, Tempo, Mimir, alerting, dashboards
├── 05-log-aggregation.md ← Loki vs ELK, pipeline design, retention
├── 06-distributed-tracing.md ← span model, Jaeger/Zipkin/Tempo, sampling
├── 07-ebpf-observability.md ← Pixie, Hubble, Parca, continuous profiling
├── 08-continuous-profiling.md ← pprof, pyroscope, flame graphs over time
├── 09-sli-slo-sla.md ← definitions, error budgets, burn rate math
├── 10-alerting-design.md ← multiwindow alerting, symptom vs cause, fatigue
└── 11-observability-at-scale.md ← cardinality management, sampling, aggregation
Cross-References
- Section 17 (Distributed Systems): Logical clocks, causal ordering — trace propagation requires causality
- Section 22 (Kubernetes Internals): ServiceMonitor, PodMonitor CRDs; OTel Operator; service discovery
- Section 24 (Debugging): strace, perf, eBPF — low-level complement to observability
- Section 25 (Performance Engineering): Profiling integration; latency histograms; flame graph interpretation
- Section 28 (Reliability Engineering): SLO definitions, error budgets, incident response using observability data