05 — Logging Systems

Technical Overview

Logging is the oldest and most universal form of application observability: writing a timestamped record of events to a persistent stream. Despite being the most familiar tool, logging systems at scale involve complex infrastructure challenges — throughput (terabytes per day), structured ingestion, cost management, compliance, and real-time search. The progression from grep-ing individual log files to distributed log aggregation pipelines with real-time search represents one of the more significant infrastructure evolutions in production engineering.

Prerequisites

Familiarity with syslog and basic log formats
Understanding of message queues at a conceptual level
Basic Linux systems administration
Familiarity with Elasticsearch or similar search systems is helpful but not required

Core Content

Structured Logging

The fundamental divide in logging is structured vs unstructured. Unstructured logs are human-composed strings:

2024-01-15 14:32:01 INFO  Processing order 12345 for user john@example.com: total $99.99
2024-01-15 14:32:02 ERROR Failed to charge card for order 12345: card declined

These are readable by humans but require regex parsing to extract fields. The regex breaks when message formats change. Structured logs are machine-native records:

{"ts":"2024-01-15T14:32:01Z","level":"INFO","msg":"order.processing","order_id":"12345","user_id":"u-789","amount":99.99,"currency":"USD"}
{"ts":"2024-01-15T14:32:02Z","level":"ERROR","msg":"payment.failed","order_id":"12345","reason":"card_declined","gateway":"stripe","span_id":"a3ce929d0e0e4736"}

Structured logs enable: exact-match field queries (find all logs where order_id="12345"), aggregations (count errors by reason), metric extraction (derive payment_failure_rate from log fields), and correlation via span_id or trace_id fields.

Best practices for structured logging: - Include trace_id and span_id in every log entry for trace correlation - Use consistent field names across all services (user_id not sometimes userId, sometimes uid) - Never interpolate user data into the msg field — put it in dedicated fields - Choose a standard: JSON (universal), logfmt (key=value pairs), or structured logging libraries (Go's slog, Java's Logback with JSON layout, Python's structlog)

Log Levels

Log levels indicate severity and intended audience:

DEBUG: diagnostic information for developers. Should be disabled in production or dynamically enabled only during active debugging. High volume. Example: {"msg":"cache.lookup","key":"session:u-789","result":"miss"}
INFO: normal operational events. Key business actions, lifecycle events. Should be low-to-medium volume. Example: {"msg":"order.created","order_id":"12345"}
WARN: unexpected condition that doesn't require immediate action but indicates potential problems. Example: {"msg":"retry.attempt","attempt":3,"max":5,"reason":"timeout"}
ERROR: an operation failed. Requires attention. Should always include error details and context. Example: {"msg":"payment.failed","error":"card_declined","order_id":"12345"}
FATAL: process cannot continue, about to exit. Use sparingly. Triggers immediate alerting. Example: {"msg":"db.connection_pool_exhausted","max_connections":100}

Production rule: INFO should be the minimum level in production for most services. DEBUG should be dynamically toggleable without restart (via an HTTP endpoint or configuration reload). Never log at DEBUG in the hot path; it will saturate your pipeline during incidents.

Log Aggregation Pipeline

LOG AGGREGATION PIPELINE

  Application Pods                Node/Collector Layer          Aggregation/Storage

  ┌─────────────────┐            ┌─────────────────┐
  │ app container   │ stdout/    │                 │
  │ (JSON to stdout)│─stderr───→ │ Fluent Bit      │
  └─────────────────┘            │ (DaemonSet)     │
  ┌─────────────────┐            │                 │         ┌──────────────────┐
  │ app container   │ stdout ──→ │ - tail /var/log │──────→  │ Kafka / Kinesis  │
  └─────────────────┘            │ - parse JSON    │         │ (buffer, replay) │
  ┌─────────────────┐            │ - add k8s meta  │         └────────┬─────────┘
  │ app container   │ stdout ──→ │ - forward       │                  │
  └─────────────────┘            └─────────────────┘                  │
                                                           ┌───────────▼──────────┐
  System Logs                    ┌─────────────────┐       │  Fluentd / Logstash  │
  ┌─────────────────┐            │                 │       │  (enrichment,        │
  │ /var/log/syslog │──────────→ │ Vector / Fluent │──────→│   filtering,         │
  │ /var/log/kernel │            │ Bit (systemd)   │       │   routing)           │
  └─────────────────┘            └─────────────────┘       └───────────┬──────────┘
                                                                        │
                                                           ┌────────────▼─────────┐
                                                           │   Storage & Index    │
                                                           │  ┌─────────────────┐ │
                                                           │  │ Elasticsearch   │ │
                                                           │  │ (full-text +    │ │
                                                           │  │  field search)  │ │
                                                           │  └─────────────────┘ │
                                                           │  ┌─────────────────┐ │
                                                           │  │ Grafana Loki    │ │
                                                           │  │ (label-indexed, │ │
                                                           │  │  S3 storage)    │ │
                                                           │  └─────────────────┘ │
                                                           │  ┌─────────────────┐ │
                                                           │  │ S3/GCS (cold)   │ │
                                                           │  │ (Parquet/ORC    │ │
                                                           │  │  for Athena/BQ) │ │
                                                           │  └─────────────────┘ │
                                                           └──────────────────────┘

ELK Stack

The ELK stack (Elasticsearch + Logstash + Kibana) was the dominant log aggregation solution 2014-2020.

Elasticsearch: distributed search and analytics engine. Inverted index on all fields. Supports full-text search, field queries, aggregations, and real-time indexing. High cost: requires ~50-100 bytes of index overhead per byte of raw log data. At 1 TB/day of logs, ES cluster requires 10+ TB storage and 8-16 nodes.
Logstash: JVM-based log processing pipeline (input → filter → output). Can parse, enrich, and route logs. Heavy: ~1-2GB heap. Being replaced by lighter alternatives.
Kibana: web UI for Elasticsearch. Dashboard, query, visualization. Also provides machine learning (anomaly detection), APM (lightweight tracing), and SIEM features.

The "B" in some stacks refers to Beats (Elastic's lightweight shippers — Filebeat for logs, Metricbeat for metrics) replacing Logstash for collection.

EFK Stack

EFK (Elasticsearch + Fluentd + Kibana) replaces Logstash with Fluentd. Fluentd is lighter (Ruby-based, ~40MB RAM), has better plugin ecosystem for cloud-native environments, and is a CNCF project. Most Kubernetes production deployments use Fluentd or Fluent Bit in the EFK pattern.

Grafana Loki

Loki (released 2018, Grafana Labs) is designed to be "Prometheus but for logs." Its key design choices make it dramatically cheaper than Elasticsearch:

Label-indexed only: Loki indexes only a small set of metadata labels (not the log content). This is analogous to Prometheus's label model. Labels are low-cardinality: {app="checkout", env="production", region="us-east-1"}.
Chunk storage: log content is stored in compressed chunks in object storage (S3/GCS). No full-text index on content.
LogQL for queries: functional query language (like PromQL). Queries filter by labels first, then apply content-based filters:

# All errors from the checkout service in the last 5 minutes
{app="checkout", env="production"} |= "ERROR"

# Parse JSON and filter by field
{app="checkout"} | json | order_status="failed" | line_format "{{.order_id}} {{.reason}}"

# Count error rate per service
sum by (app) (rate({env="production"} |= "ERROR" [1m]))

# Extract a metric from logs (latency histogram)
{app="checkout"} | json | unwrap response_time_ms | histogram_quantile(0.99, sum by (app) (rate({app=~".+"}[5m])))

Cost comparison (rough): Elasticsearch at 1 TB/day of ingestion requires ~$5,000-15,000/month in hardware or cloud costs. Loki for the same volume using S3 storage costs ~$500-1,500/month (S3 + query CPU). Loki is 5-10x cheaper because it has no full-text inverted index.

Loki architectural components: - Distributor: receives inbound log streams, validates, fans out to ingesters - Ingester: buffers recent logs in memory (chunks), flushes to object storage - Querier: executes LogQL queries, reads from ingesters (recent) and object storage (historical) - Compactor: compacts small object storage files into larger ones for query efficiency - Ruler: evaluates LogQL alerting rules

Fluentd vs Fluent Bit

Both are log shippers maintained by Treasure Data / the CNCF Fluentd project:

Fluent Bit: written in C. ~1MB binary, ~20-50MB RAM. Designed for edge/container/embedded. Ships logs from source to destination with basic filtering and parsing. Runs as DaemonSet in Kubernetes. Excellent for log collection from pods.

Fluentd: written in Ruby. ~40-100MB RAM. Full-featured log aggregation server. Rich plugin ecosystem (500+ plugins). Suitable as a central aggregation tier: receive from multiple Fluent Bit agents, enrich/transform, route to multiple backends.

Production pattern: Fluent Bit DaemonSet collects from pods → forwards to central Fluentd deployment → Fluentd routes to Elasticsearch + Loki + S3 cold storage simultaneously.

# Fluent Bit DaemonSet config (kubernetes.conf)
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Parser            docker
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     50MB
    Skip_Long_Lines   On
    Refresh_Interval  10

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    Keep_Log            Off
    K8S-Logging.Parser  On

[OUTPUT]
    Name  forward
    Match *
    Host  fluentd.logging.svc.cluster.local
    Port  24224

Log Retention and Cost Management

Log retention is a significant cost driver. Strategies:

Tiered storage: hot (7-30 days in Elasticsearch/Loki for fast queries), warm (30-90 days in slower storage or S3 + Athena), cold (1-7 years in S3 Glacier or equivalent for compliance).
Log sampling: for very high-volume, low-value logs (health check endpoints, static asset requests), sample 1-10%. Emit a metric for the dropped count.
Aggregation before storage: for repetitive logs (same error message repeated 1000 times/second), aggregate client-side and log once with a count field.
Log levels in production: disable DEBUG in production. INFO-only reduces volume significantly.

Cost example: at 500GB/day raw logs: - Elasticsearch (hot 30 days): ~15 TB storage, ~$1,500/month (cloud) - Loki (hot 7 days + S3 cold): ~500 GB hot + 14 TB S3, ~$300/month total - S3 + Athena (cold only): ~15 TB S3, $50/month storage + query costs

Log Sampling

Sampling is reducing the volume of logs written while preserving statistical meaning. Implementations: - Rate limiting: log at most N events per second for a given log site - Probabilistic sampling: log 1/N events (every N-th occurrence) - Adaptive sampling: increase rate when error rate increases, decrease during normal operation - Head-of-line deduplication: if the same log message appears 100 times in 1 second, log once with count=100 and first_seen/last_seen timestamps

PII in Logs

Personally Identifiable Information (PII) in logs creates compliance and security risks:

GDPR (EU): personal data must be processed lawfully, stored only as long as necessary, and accessible only to authorized personnel. Right to erasure ("right to be forgotten") requires the ability to delete a user's data from logs — nearly impossible without log-line-level deletion capability.

HIPAA (US healthcare): Protected Health Information (PHI) in logs requires encryption at rest, access controls, audit logging, and breach notification procedures.

PCI-DSS: credit card numbers (PANs) and CVVs must never appear in logs. Tokenize before logging.

Practical controls: 1. Redaction at source: scrub PII in the application before logging. Use library-level redaction (e.g., a custom JSON marshaler that masks email fields). 2. Collector-level scrubbing: Fluent Bit/Fluentd can apply regex replacements to redact patterns matching email addresses, credit card numbers, SSNs. 3. Audit logging isolation: separate audit logs (user actions with PII) from operational logs (infrastructure metrics). Apply stricter retention and access policies to audit logs. 4. Pseudonymization: replace user IDs with deterministic hashes for operational logs. Mapping table held separately under strict access control.

Historical Context

Log management evolved through several phases: syslog (RFC 3164, 1984) defined the first standard format for remote log transmission. The UNIX philosophy of "everything is text" meant logs were human-readable strings with no schema. Log aggregation at scale became a problem in the 2000s as companies ran hundreds of servers; tools like Splunk (founded 2003) built the first commercial search-over-logs platforms.

The Logstash + Elasticsearch + Kibana (ELK) stack emerged 2012-2014 as an open-source alternative to Splunk, democratizing log search for smaller companies. But Elasticsearch's cost at scale (full-text indexing is expensive) drove the development of cheaper alternatives.

Grafana Loki (2018) applied Prometheus's label model to logs, dramatically reducing storage costs by forgoing full-text indexing. This triggered a shift toward "index the metadata, store the content in object storage" as the dominant architecture pattern for high-volume log systems.

Production Examples

# Query Loki for errors in the last hour from checkout service
logcli query '{app="checkout", env="production"} |= "ERROR"' --limit 100 --since 1h

# Find logs with a specific trace ID
logcli query '{namespace="payments"} | json | trace_id="4bf92f3577b34da6"'

# Count errors per service in last 5 minutes
logcli query 'sum by (app) (rate({env="production"} |= "ERROR" [5m]))'

# Elasticsearch: find all logs for a specific order
curl -X POST http://elasticsearch:9200/logs-*/_search -H 'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "must": [
        {"term": {"order_id": "12345"}},
        {"range": {"@timestamp": {"gte": "now-1h"}}}
      ]
    }
  },
  "sort": [{"@timestamp": {"order": "asc"}}],
  "size": 100
}'

Debugging Notes

Logs not appearing in Loki: Check Fluent Bit DaemonSet logs: kubectl logs -n logging -l app=fluent-bit. Look for [error] in output. Common causes: Loki endpoint unreachable (check service/DNS), incorrect labels (Loki rejects labels with special characters), log parsing failure (malformed JSON).

High log volume from a specific service: Use logcli query 'sum by (pod) (rate({namespace="payments"}[1m]))' to identify the highest-volume pod. Then inspect what log level it's set to — often a DEBUG flag accidentally left on in production.

Log gaps during incidents: Fluent Bit has an in-memory buffer. If the node is under heavy load and logs exceed the buffer limit (Mem_Buf_Limit), Fluent Bit drops logs. Set Mem_Buf_Limit to 500MB+ for production DaemonSets, and use persistent disk-backed buffering for critical logs.

Elasticsearch out of disk: Elasticsearch requires 15% free disk to avoid read-only mode. Set an alert at 80% disk usage. Use Index Lifecycle Management (ILM) to automatically delete old indices.

Security Implications

Log aggregation pipelines often receive logs from all services, making them a high-value target. The aggregation service (Fluentd, Vector) should have minimal network access — only able to write to the storage backend, not read from production databases.
Kibana and Grafana (Loki) dashboards should have RBAC. Different teams should only see logs from their own services/namespaces.
Log lines containing exception stack traces may include internal hostnames, IP addresses, database credentials (if logged carelessly), or filesystem paths. Treat log data as potentially sensitive.
Log injection attacks: if user-controlled data is included in log messages (e.g., a username containing a newline), an attacker can inject fake log entries. Structured JSON logging prevents this as newlines in field values are escaped.

Performance Implications

Synchronous log writes on the hot path add latency. Always use async logging (buffered queue + background writer thread/goroutine).
JSON serialization cost in Go: ~500ns per log entry with slog. In Java (Logback with Jackson): ~2-5 microseconds. This is acceptable for most services.
Fluent Bit CPU on a node with high log volume: typically 0.5-2% CPU. Memory: 50-200MB with default settings.
Loki query performance: label-filtered queries (use labels heavily) are fast. Full-scan queries ({app="checkout"} |= "some string" on weeks of data) are slow and expensive — they scan all chunks.
Elasticsearch indexing at high ingest rate requires careful shard sizing. Rule of thumb: shards 20-50GB. Too many small shards degrade query performance.

Failure Modes and Real Incidents

GitLab accidental database deletion and log evidence (2017): When GitLab's database was accidentally deleted, logs were essential for reconstructing what happened, when, and to whom. However, some logs had been lost due to log shipping failures. This incident motivated many companies to implement log archiving to write-once storage (S3 with Object Lock) for audit purposes.

Log flood causing Elasticsearch OOM (production incident pattern): A misconfigured service started logging at DEBUG level in production during a traffic spike, generating 50GB/hour instead of the expected 500MB/hour. Elasticsearch fell behind on indexing, JVM heap exhausted, cluster went read-only. All services depending on log-based alerting went blind. Fix: set per-service log rate limits in Fluentd; alert on log ingest rate per namespace.

Silent PII leak in trace IDs: A team added user_email to their structured log format "temporarily for debugging" and forgot to remove it. GDPR audit found email addresses in 14 months of log archives, triggering a compliance incident. Fix: automated scanning of log schema changes in CI; PII detection (regex patterns for email, SSN, credit card) in the log pipeline.

Modern Usage

Grafana Loki 3.x: now supports bloom filters for faster content search, TSDB index backend, and native OTel log ingestion via OTLP.
Vector (Datadog): Rust-based, extremely fast log/metric/trace router. Replacing Fluentd in some high-throughput deployments for its performance characteristics.
OpenTelemetry Logs: OTLP can now carry log records. Auto-instrumentation can capture application logs and ship them via the same OTLP pipeline as traces and metrics, enabling tight correlation.
Log-based metrics: deriving Prometheus metrics from log patterns (Loki's LogQL metric queries, Datadog log-based metrics) reduces the need for separate metric instrumentation for some signals.

Future Directions

S3 + Iceberg/Delta Lake for log analytics: storing raw logs in S3 in columnar formats with Iceberg/Delta table metadata enables efficient analytical queries (via Trino, Athena, BigQuery) at a fraction of Elasticsearch cost. Real-time indexing trades for batch querying.
OTel unified pipeline: as OTel matures for logs, a single OTLP pipeline (one Collector, one protocol) carries metrics, traces, and logs together, simplifying the aggregation stack.
AI log summarization: LLM-based tools that automatically summarize log streams during incidents, highlighting anomalies and generating natural-language explanations of error patterns.

Exercises

Structured log migration: Take an existing service with printf-style logs. Migrate it to structured JSON logging using the appropriate library (slog for Go, structlog for Python, Logback JSON layout for Java). Ensure every log entry includes trace_id, span_id, service, level, msg, and relevant context fields.
Loki vs Elasticsearch cost model: Calculate the monthly storage and infrastructure cost for 1 TB/day of log ingestion with a 30-day retention policy using Elasticsearch (self-hosted on EC2) vs Loki (with S3 backend on EKS). What are the query performance trade-offs?
PII audit: Write a Fluent Bit or Fluentd filter that detects and redacts: email addresses (regex), credit card numbers (Luhn-validated regex), and US Social Security Numbers. Test with sample log lines containing these patterns. Verify the redacted output.
Log pipeline resilience: Design a log pipeline that survives: (a) the Loki/Elasticsearch cluster going down for 30 minutes, (b) the Fluent Bit DaemonSet pod being evicted, (c) a network partition between the logging namespace and storage. For each failure, specify what happens to logs generated during the outage.
Log-based alerting: Using Loki LogQL alerting rules, create alerts for: (a) error rate > 1% in the checkout service for 5 consecutive minutes, (b) any log line containing "panic:" or "fatal error:" from any service, (c) no logs received from the payment service for 5 minutes (absence alert). Test each alert fires correctly.

References

Grafana Loki Documentation: https://grafana.com/docs/loki/
Elasticsearch Guide: https://www.elastic.co/guide/
Fluent Bit Documentation: https://docs.fluentbit.io/
Fluentd Documentation: https://docs.fluentd.org/
Majors, Charity et al. Observability Engineering. O'Reilly, 2022. Chapter 4 (Logs).
Richardson, Chris. "Microservices Patterns." Manning, 2018. Chapter 11 (Observability).
GDPR compliance for logs: https://gdpr.eu/what-is-gdpr/
Gregg, Brendan. Systems Performance. 2nd ed. Chapter 14 (Benchmarking).