02 — Prometheus Internals

Technical Overview

Prometheus is an open-source monitoring system and time-series database (TSDB) originally built at SoundCloud in 2012, open-sourced in 2015, and the second project to graduate from the CNCF after Kubernetes. It follows a pull-based (scrape) model where Prometheus actively fetches metrics from instrumented targets rather than having targets push data. Its data model — metric name plus an arbitrary set of key=value labels — enables flexible multi-dimensional querying via PromQL. Prometheus is the de facto standard metric collection and alerting system for Kubernetes-native workloads.

Prerequisites

Understanding of time-series data concepts
Familiarity with HTTP endpoints and service discovery
Basic exposure to Linux systems administration
Comfort with regular expressions for label matching

Core Content

Data Model

Every Prometheus metric is uniquely identified by a combination of: - Metric name: e.g., http_requests_total - Label set: e.g., {method="POST", handler="/api/v1/checkout", status="200"} - Timestamp: millisecond precision, int64 - Value: float64

Together these form a sample. A time-series is all samples for one unique metric-name + label-set combination over time.

http_requests_total{method="POST", handler="/checkout", status="200"} 1027 1715869200000
http_requests_total{method="GET",  handler="/health",   status="200"}  542 1715869200000
http_requests_total{method="POST", handler="/checkout", status="500"}    3 1715869200000

Labels beginning with __ are reserved (internal Prometheus use). The __name__ label holds the metric name internally — up{job="api-server"} is stored as {__name__="up", job="api-server"}.

Metric Types

Prometheus client libraries expose four metric types, though the TSDB stores everything as flat time-series:

Counter: monotonically increasing value. Resets to zero on restart. Use rate() or increase() to make it meaningful. Example: http_requests_total. Never use a counter for a value that can go down.

Gauge: value that can go up or down. Current state snapshot. Example: memory_bytes_used, active_connections, queue_depth.

Histogram: samples observations into configurable buckets, exposes _bucket, _sum, _count series. Use histogram_quantile() for approximate percentiles. Client-side: you define bucket boundaries. Example: http_request_duration_seconds{le="0.1"} = count of requests completing in <100ms.

Summary: like histogram but calculates quantiles client-side (phi-quantiles). Cannot be aggregated across instances. Use histograms instead unless you need exact quantiles and only have one instance.

# Histogram expands to:
http_request_duration_seconds_bucket{le="0.005"} 0
http_request_duration_seconds_bucket{le="0.01"}  0
http_request_duration_seconds_bucket{le="0.025"} 5021
http_request_duration_seconds_bucket{le="0.05"}  18032
http_request_duration_seconds_bucket{le="0.1"}   24054
http_request_duration_seconds_bucket{le="+Inf"}  24054  # total count
http_request_duration_seconds_sum  1204.3
http_request_duration_seconds_count 24054

Scrape Mechanism

PROMETHEUS SCRAPE → TSDB → PROMQL FLOW

  Target (service)            Prometheus Server            Querier
  ┌─────────────┐            ┌──────────────────────────┐
  │  /metrics   │ ←HTTP GET  │  Scrape Manager           │
  │  (text/open │ ─────────→ │  (per target, goroutine)  │
  │   metrics)  │            │          │                 │
  └─────────────┘            │          ▼                 │
                             │  Scrape Parser             │
                             │  (text → samples)          │
                             │          │                 │
                             │          ▼                 │
                             │  Head Block (in-memory)    │
                             │  ┌───────────────────────┐ │
                             │  │ memSeries[] + WAL     │ │  ←─ writes go here
                             │  └───────────┬───────────┘ │
                             │              │ compact      │
                             │              ▼ every 2h    │
                             │  Persistent Block          │  ←─ reads via
                             │  ┌─────────────────────┐   │     PromQL engine
                             │  │ chunks/ index  meta │   │
  PromQL Query ────────────────→│ tombstones          │◄──┘
                             │  └─────────────────────┘   │
                             └──────────────────────────┘

The scrape loop for each target runs as an independent goroutine. On each scrape_interval tick (default 15s, configurable per job): 1. HTTP GET target_address/metrics 2. Parse response (text/plain or OpenMetrics format) 3. Apply relabeling rules (relabel_configs) 4. Write samples to the head block 5. Record in WAL (Write-Ahead Log)

Service Discovery

Prometheus does not require static target lists. Service discovery (SD) mechanisms dynamically populate the target list:

Kubernetes SD (kubernetes_sd_configs): watches K8s API for pods, endpoints, nodes, services. Automatically discovers all pods with specific annotations.
EC2 SD (ec2_sd_configs): queries AWS EC2 API, discovers instances by tags.
File SD (file_sd_configs): reads JSON/YAML files on disk. Other systems (Consul, Ansible) write target files; Prometheus watches for changes.
Consul SD: queries Consul service catalog.
DNS SD: SRV record lookups.

Standard pattern for Kubernetes scraping via pod annotations:

# On the target pod:
annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9090"
  prometheus.io/path: "/metrics"

TSDB Internals

Prometheus's TSDB (originally Fabian Reinartz's tsdb library, now embedded) is a purpose-built columnar time-series store.

Head Block: The most recent ~2 hours of data lives in memory in the head block. It contains memSeries structs — one per active time-series — each holding a compressed chunk of recent samples. The head block is also backed by a WAL (Write-Ahead Log) on disk for crash recovery.

WAL: Samples are appended to the WAL before being committed to the in-memory head block. On restart, Prometheus replays the WAL to reconstruct the head block. WAL segments are fixed-size (128MB default), written sequentially for performance.

Chunks: Within a memSeries, samples are compressed using XOR encoding (Gorilla compression from Facebook's time-series paper). This achieves ~1.37 bytes per sample for typical monotonically increasing float64 values, down from 16 bytes (8-byte timestamp + 8-byte float64).

Persistent Blocks: Every ~2 hours, the head block is compacted into a persistent block on disk. A block is a directory containing:

./data/01ABCDEF1234567890ABCDEF/
  chunks/  000001  # actual sample data
  index    # series labels → chunk file offsets
  meta.json  # block metadata (time range, stats)
  tombstones  # records deleted series

Compaction: Prometheus periodically merges smaller blocks into larger ones to improve query performance and reduce file count. The compaction schedule follows a 1:5 ratio — blocks are merged when 5 blocks of similar size exist.

Retention: Default 15-day retention. Blocks outside the retention window are deleted wholesale. TSDB does not support partial block deletion (hence tombstones for deleted series).

TSDB Block Layout on Disk
.
├── 01GZTB...  (2h block, newest)
│   ├── chunks/
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── 01GZTA...  (2h block)
├── 01GZSX...  (8h compacted block)
├── 01GZQ0...  (24h compacted block)
├── 01GZM0...  (5-day compacted block)
└── wal/
    ├── 00000001
    ├── 00000002
    └── checkpoint.000001/

PromQL

PromQL is a functional query language for time-series. Key concepts:

Instant vector: the current value of all matching time-series at a single point in time.

http_requests_total{job="api-server"}

Range vector: all values within a time window for matching series. Used as input to rate functions.

http_requests_total{job="api-server"}[5m]

Key functions:

rate(counter[window]): per-second average rate of increase over the window. Handles resets automatically. Use for all counter metrics.

rate(http_requests_total{status=~"5.."}[5m])

increase(counter[window]): total increase over window (rate * window seconds). Same as rate * window.

increase(http_requests_total[1h])

histogram_quantile(phi, le_histogram): approximate quantile from histogram buckets.

histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

irate(): instantaneous rate using last two samples only. More responsive to spikes, but noisier. Avoid for alerting; use rate().

absent(): returns 1 if no samples match. Used in "no data" alerts.

absent(up{job="critical-service"})

Aggregation operators:

# Sum across all instances, keeping job label
sum by (job) (rate(http_requests_total[5m]))

# 99th percentile latency by service
histogram_quantile(0.99,
  sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

Recording Rules and Alerting Rules

Recording rules pre-compute expensive PromQL expressions and store results as new time-series. Essential for dashboard performance and for computing SLO metrics efficiently:

groups:
  - name: api_server_slo
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, status) (rate(http_requests_total[5m]))
      - record: job:http_request_errors:rate5m_ratio
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

Alerting rules evaluate PromQL expressions and fire alerts to Alertmanager:

  - alert: HighErrorRate
    expr: job:http_request_errors:rate5m_ratio > 0.01
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.job }}"

The for clause requires the condition to be true for the specified duration before firing. This reduces false positives from transient spikes.

Alertmanager

Alertmanager receives alerts from Prometheus (and other sources) and handles: - Deduplication: identical alerts firing from multiple Prometheus replicas are deduplicated - Grouping: related alerts are grouped into single notifications (group_by: [alertname, cluster]) - Routing: different alert groups go to different receivers (PagerDuty for critical, Slack for warning) - Inhibition: if a cluster-level alert fires, inhibit all service-level alerts in that cluster - Silencing: manually suppress alerts during maintenance windows

Remote Storage for Long-Term Retention

Prometheus's local TSDB is not designed for long-term storage (>2 weeks) or global aggregation. Remote storage solutions:

Thanos: adds sidecar to Prometheus, ships blocks to object storage (S3/GCS/Azure). Provides global query across multiple Prometheus instances.
Cortex: horizontally scalable, multi-tenant Prometheus. Each Prometheus remote-writes to Cortex's ingest path.
Mimir (Grafana Labs): fork/successor of Cortex with architectural improvements. Preferred for new deployments.
VictoriaMetrics: drop-in Prometheus replacement with better compression and performance.

Prometheus (remote_write) ──→ Thanos Receive / Cortex / Mimir
                                      │
                                      ▼
                            Object Storage (S3)
                                      │
                                      ▼
                            Thanos Querier / Mimir Query-frontend
                                      │
                              Global PromQL queries

Historical Context

Prometheus was inspired by Google's Borgmon monitoring system (described in the Google SRE book). Matt Proud and Julius Volz built the initial version at SoundCloud in 2012. It was designed from the start around Go's concurrency model and the pull-based scrape architecture. The choice of pull over push was deliberate: pull makes it easy to detect dead targets (up metric goes to 0), avoids having targets know the monitoring endpoint, and allows rate-limiting naturally.

The TSDB was rewritten by Fabian Reinartz for Prometheus 2.0 (2017). The original storage was block-per-time-period with significant write amplification; the new TSDB reduced storage requirements by ~85% and query times dramatically. Reinartz published the design in "Writing a Time Series Database from Scratch" (2017), which remains the canonical reference.

Production Examples

# Check what Prometheus is currently scraping
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health, lastScrape: .lastScrape}'

# Query the top 5 jobs by request rate
curl -sG http://localhost:9090/api/v1/query \
  --data-urlencode 'query=topk(5, sum by (job) (rate(http_requests_total[5m])))' | jq .

# Inspect TSDB statistics
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats'

# Find series with highest cardinality
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'

# Check WAL size on disk
du -sh /var/lib/prometheus/wal/

A production recording rule for SLO error budget consumption:

# 5-minute error rate (pre-computed for dashboard and alerting)
- record: slo:http_errors:rate5m
  expr: |
    sum by (service, env) (rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum by (service, env) (rate(http_requests_total[5m]))

# 1-hour burn rate for fast burn alert
- record: slo:http_errors:burnrate1h
  expr: |
    sum by (service, env) (rate(http_requests_total{status=~"5.."}[1h]))
    /
    sum by (service, env) (rate(http_requests_total[1h]))

Debugging Notes

"No data" on a panel: First check up{job="your-job"}. If 0, the target is not reachable. Check /targets in the Prometheus UI for the last error message. Common causes: wrong port, missing service discovery annotation, network policy blocking port 9090.

Staleness vs. gaps: After a target disappears, Prometheus marks the last sample with a stale marker (a special NaN value). PromQL functions like rate() handle stale markers correctly. Gaps appear in graphs if the scrape interval is longer than the query resolution.

Counter resets: rate() automatically handles counter resets (pod restarts). If increase() shows a negative value, the counter reset during the window and the function handled it incorrectly for a brief period. Use longer windows to smooth this out.

High cardinality diagnosis:

# Find the top cardinality offenders
curl -s http://localhost:9090/api/v1/status/tsdb | \
  jq '.data.seriesCountByLabelValuePair | sort_by(.value) | reverse | .[0:20]'

Alertmanager routing not working: Test routing configuration with amtool check-config alertmanager.yml. Simulate an alert: amtool alert add alertname="TestAlert" severity="critical".

Security Implications

Prometheus /metrics endpoints expose internal application state. They should not be publicly accessible. Use network policies or NGINX auth proxies.
Prometheus itself has no built-in authentication before v2.24. The --web.enable-admin-api flag exposes delete endpoints; disable in production unless needed.
Alertmanager webhook receivers can receive arbitrary JSON. Validate webhook payloads; don't execute shell commands from them.
Remote write to Thanos/Mimir should use TLS and token authentication. Prometheus remote write supports bearer_token, tls_config, and basic_auth.

Performance Implications

Prometheus is designed for single-machine operation. Memory scales with number of active time-series: ~3KB per series in the head block. 2M series ≈ 6GB RAM.
Query performance on the head block is fast (in-memory). Queries spanning historical blocks require disk I/O — use recording rules for queries that run frequently.
rate() over a long window (e.g., [24h]) on many series is expensive. Pre-compute with recording rules.
histogram_quantile() is expensive at high cardinality. Use sum by (le) to aggregate before passing to histogram_quantile().
Scraping 10,000 targets with 15s interval generates ~667 scrapes/second. Each scrape is a goroutine + HTTP connection + parse. This is Prometheus's practical upper limit without federation or sharding.

Failure Modes and Real Incidents

Cardinality explosion from a bad deployment: A team deployed code that added a user_id label to HTTP metrics. Within minutes, Prometheus OOM-killed as it tried to create millions of new series. Fix: set cardinality limits per metric (enforcedSampleLimit in scrape config), and add CI checks for new label dimensions.

Split-brain in Prometheus HA: Running two Prometheus replicas in HA mode (scraping the same targets) produces duplicate alerts. Without Alertmanager deduplication correctly configured, on-call engineers receive double pages. Always configure --cluster.* flags on Alertmanager for HA, and set group_wait + repeat_interval carefully.

WAL corruption on unclean shutdown: If a Prometheus instance is killed with SIGKILL (not SIGTERM), the WAL may be partially written. Prometheus handles this by truncating the WAL to the last valid checkpoint on startup. The --storage.tsdb.wal-compression flag adds snappy compression to the WAL; this slightly increases corruption detectability (bad compressed data is obvious).

Modern Usage

Prometheus Operator (Kubernetes): manages Prometheus and Alertmanager deployments declaratively via CRDs (ServiceMonitor, PrometheusRule). Standard in production Kubernetes.
Prometheus Agent Mode (v2.32+): stateless scrape-and-forward-only mode. No local TSDB, only remote write. Designed for edge/sidecar deployments where you only need to forward metrics to a central store.
Native histograms (v2.40+): a new histogram type that uses variable-width buckets instead of fixed user-defined buckets. Much better quantile accuracy without bucket configuration, at the cost of more series.

Future Directions

Prometheus 3.0: major cleanup, native histogram as default, improved OTLP ingestion endpoint.
OTel metrics replacing Prometheus SDK: The OpenTelemetry metrics SDK can export in Prometheus format; increasingly, teams instrument once with OTel and scrape with Prometheus.
Remote write 2.0: more efficient binary framing, support for native histograms, better compression.

Exercises

Cardinality audit: Deploy a Prometheus instance scraping a test application. Use the /api/v1/status/tsdb endpoint to identify the top-5 highest-cardinality metrics. Write a PromQL query that counts unique series per job.
TSDB inspection: On a running Prometheus instance, inspect the block directory structure at /var/lib/prometheus/. Identify which block covers which time range by reading meta.json. Explain why some blocks are larger than 2 hours.
Recording rule performance: Create a recording rule for histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))). Measure query time before and after the recording rule via the /api/v1/query API and compare.
Multi-window burn rate alert: Implement a multi-window multi-burn-rate (MWMBA) alerting rule for a 99.9% SLO on http_requests_total. Use the two-window approach from Google SRE: 1h window with 14x burn rate AND 5m window with 14x burn rate for fast burn; 6h window with 6x burn rate AND 30m window with 6x burn rate for slow burn.
Service discovery: Configure Prometheus to discover targets from a local JSON file using file_sd_configs. Dynamically add and remove a target from the JSON file and observe Prometheus's scrape_duration_seconds and up metrics update accordingly.

References

Prometheus documentation: https://prometheus.io/docs/
Reinartz, Fabian. "Writing a Time Series Database from Scratch." 2017. https://fabxc.org/tsdb/
Beyer, Betsy et al. Site Reliability Engineering. Chapter 10 (Practical Alerting).
Volz, Julius and Wilkie, Tom. "Prometheus: Up & Running." O'Reilly, 2018.
Slott, Brian. "Prometheus Storage: Technical Terms for the TSDB." PromCon 2017.
Google SRE Workbook. Chapter 5 (Alerting on SLOs). https://sre.google/workbook/alerting-on-slos/