Message Queues and Event-Driven Architecture

Overview

Message queues and event-driven architectures solve the coupling problem in distributed systems: when Service A calls Service B synchronously, both must be available simultaneously, both must agree on the interface, and slow B responses slow A. Message queues decouple producers from consumers — the producer writes a message and continues without waiting; the consumer reads and processes at its own pace. This simple idea enables some of the most scalable and resilient distributed systems ever built.

Beyond basic decoupling, the event-driven model enables fundamentally different architectural patterns: event sourcing (state as a log of events), CQRS (separate read and write models), and saga choreography (distributed transactions via events). Understanding when and how to apply these patterns — and their tradeoffs — is essential for senior system design.

Prerequisites

Understanding of distributed systems basics: network partitions, at-least-once delivery
Familiarity with TCP and basic networking
Understanding of database transactions and ACID properties
Knowledge of consumer groups and parallel processing concepts
Basic familiarity with Apache Kafka or RabbitMQ

Historical Context

Message queuing predates the internet. IBM's Information Management System (IMS) in 1968 included message queuing for batch processing. The formalization came with MQ Series (now IBM MQ) in 1993, which standardized message persistence, delivery guarantees, and pub/sub patterns for enterprise middleware.

The internet era brought new challenges: millions of concurrent users, global distribution, and requirements for real-time processing. AMQP (Advanced Message Queuing Protocol) was developed in 2003 by JPMorgan Chase to standardize financial messaging, leading to RabbitMQ's implementation.

Apache Kafka (2011, LinkedIn) represented a paradigm shift. Rather than a traditional message queue with delete-on-consume semantics, Kafka is a distributed commit log: messages are retained for a configurable period, any consumer can re-read from any offset, and write throughput is orders of magnitude higher than traditional queues. LinkedIn built Kafka to handle 1 trillion messages/day across their pipeline.

AWS SQS (2004) was one of AWS's original services — predating EC2. Its simplicity (no schema, no protocol, just HTTP) made it the default choice for serverless and cloud-native architectures.

KEDA (2019) and cloud-native event bridges (EventBridge, Pub/Sub) represent the latest generation: event-driven compute that scales to zero.

Message Semantics

The delivery guarantee is the most critical design decision for any messaging system:

  Delivery Semantics Spectrum:

  At-Most-Once (fire-and-forget):
  ┌──────────┐    network may drop    ┌──────────┐
  │ Producer │ ─────────────────────> │ Consumer │
  └──────────┘  no retry, no ack     └──────────┘

  Behavior: message may be lost, never delivered twice
  Use case: metrics/telemetry where occasional loss acceptable
  Example: UDP syslog, statsd counters, live game position updates

  At-Least-Once (retry until ack):
  ┌──────────┐    send + ack     ┌──────────┐
  │ Producer │ ────────────────> │ Consumer │
  │          │ <── ACK ─────────┤          │
  │          │                  │          │
  │          │  retry on timeout│          │
  │          │ ───────────────> │          │
  └──────────┘  possible dup    └──────────┘

  Behavior: message delivered at least once, may be duplicated
  Requirement: CONSUMERS MUST BE IDEMPOTENT
  (processing same message twice produces same result)
  Example: email delivery, order processing with idempotency keys

  Exactly-Once (hardest):
  Requires: distributed coordination (2PC or Kafka transactions)

  Kafka exactly-once (producer side):
    - Producer gets unique producerID + epochID from broker
    - Each message has a sequence number
    - Broker deduplicates: rejects duplicate sequence numbers
    - BUT: only within a single topic-partition!

  End-to-end exactly-once (consumer side):
    - Consumer commits offset AND processes message in same transaction
    - Requires transactional consumer + idempotent processing
    - Only feasible when sink is a transactional system (DB)

  Real-world guidance:
    Design for at-least-once + idempotent consumers.
    True exactly-once is expensive and brittle; avoid if possible.
    Use idempotency keys (UUID per message) to deduplicate cheaply.

Apache Kafka Architecture

Kafka is the dominant event streaming platform for high-throughput use cases:

  Kafka Topic/Partition/Consumer-Group Model:

  Topic: "orders" (logical stream of events)

  Partitions (ordered, immutable log):

  Partition 0: [msg0, msg1, msg4, msg7, ...]
                 ^                        ^
                 offset 0             latest offset

  Partition 1: [msg2, msg5, msg8, ...]

  Partition 2: [msg3, msg6, msg9, ...]

  Consumer Group "order-processor":
  ┌─────────────────────────────────────────────────┐
  │                 Consumer Group                   │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
  │  │Consumer 1│  │Consumer 2│  │Consumer 3│      │
  │  │          │  │          │  │          │      │
  │  │Part. 0 ──┘  │Part. 1 ──┘  │Part. 2 ──┘      │
  └─────────────────────────────────────────────────┘

  Rules:
  - Each partition assigned to exactly ONE consumer per group
  - One consumer can handle multiple partitions
  - Cannot have more consumers than partitions (excess idle)
  - Different consumer groups read independently (own offset tracking)

  Message retention:
  - Messages NOT deleted on consume (unlike traditional queues)
  - Retained for configured period (default 7 days) or size limit
  - Consumers track own offset: can re-read, replay, or skip

  Consumer Group "analytics" reads same "orders" topic
  independently from "order-processor" group

Kafka Replication (ISR)

  ISR (In-Sync Replicas) for partition durability:

  Partition 0:
  ┌─────────────┐   replicate   ┌─────────────┐   replicate   ┌─────────────┐
  │ Broker 1    │ ─────────────>│ Broker 2    │ ─────────────>│ Broker 3    │
  │ (Leader)    │               │ (Follower)  │               │ (Follower)  │
  │ offset: 100 │               │ offset: 100 │               │ offset: 99  │
  └─────────────┘               └─────────────┘               └─────────────┘

  ISR = {Broker1, Broker2}  (Broker3 lagging → not in ISR)

  Producer acks:
  - acks=0: fire-and-forget (fastest, data loss possible)
  - acks=1: leader acknowledges (data lost if leader crashes before replication)
  - acks=all (or -1): ALL ISR members acknowledge (durability, higher latency)

  min.insync.replicas=2 + acks=all:
    Write rejected if fewer than 2 replicas available
    Protects against data loss at cost of availability

Kafka Use Cases

  1. Activity Tracking (LinkedIn origin use case):
     User clicks, page views, search queries → Kafka → Hadoop/Spark
     Scale: billions of events/day, multiple consumers for analytics

  2. Metrics Pipeline:
     App metrics → Kafka → Prometheus/Druid/InfluxDB
     Decouples metric producers from metric stores

  3. Event Streaming / Stream Processing:
     Orders → Kafka → Kafka Streams / Flink → real-time fraud detection
     State stores, windowed aggregations, joins

  4. CDC (Change Data Capture):
     Database → Debezium (reads binlog/WAL) → Kafka → 
     downstream consumers (search index, cache invalidation, analytics)
     Keeps derived systems in sync without application-level dual writes

  5. Log Aggregation:
     Service logs → Filebeat → Kafka → Elasticsearch/S3
     Replaces rsyslog; adds replay capability

Consumer Lag Monitoring

Consumer lag is the difference between the latest message offset and the last committed offset by a consumer group. High lag means the consumer is falling behind producers:

# Check consumer group lag:
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group order-processor

# Output:
# GROUP          TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# order-processor orders  0          1000423         1000500         77
# order-processor orders  1          999100          999150          50
# order-processor orders  2          1001000         1001000         0

# Total lag = 77 + 50 + 0 = 127 messages behind

# Alert thresholds (Prometheus):
# kafka_consumergroup_lag > 1000 for 5 minutes → page on-call
# kafka_consumergroup_lag_sum > 100000 → critical

# Common causes of high lag:
# - Consumer too slow (CPU-bound processing, slow DB writes)
# - Consumer crash / rebalancing (group coordinator reassigning partitions)
# - Producer burst exceeded consumer capacity
# - Downstream dependency (DB, external API) slow or down

RabbitMQ: AMQP Model

RabbitMQ uses a different model than Kafka: messages flow through exchanges and queues, and are deleted on acknowledgment.

  RabbitMQ Architecture:

  Producer → Exchange → Binding → Queue → Consumer

  Exchange types:

  Direct:    routing_key == binding_key (exact match)
             Use: specific worker pools

  Topic:     routing_key matches pattern (wildcards: * one word, # zero+)
             e.g., "order.us.*" matches "order.us.created", "order.us.failed"
             Use: filtered subscriptions

  Fanout:    broadcast to ALL bound queues (ignores routing_key)
             Use: cache invalidation, event notification

  Headers:   route by message headers (flexible, less common)

  Dead Letter Exchange (DLX):

  Queue → Consumer fails → message rejected/expired → 
  Dead Letter Exchange → Dead Letter Queue

  Use: inspect failed messages without losing them.

  Message flow:

  Publisher                   RabbitMQ                    Consumer
  ────────                   ─────────                   ────────
  publish("order.created")   
  → exchange "events"        
  → topic exchange           
  → queue "orders-svc" (bound to "order.*")
  → queue "analytics" (bound to "#")
                                                          receive msg
                                                          process msg
                                                          channel.basicAck(deliveryTag)
                                              ←── ACK ───
                             delete from queue

RabbitMQ vs Kafka decision guide:

  Choose RabbitMQ when:
  - Complex routing (topic/headers exchange patterns)
  - Task queue semantics (each task processed once, then gone)
  - Per-message TTL or priority queues needed
  - Small-medium scale (<100k msg/s)
  - Need push-based delivery to consumers

  Choose Kafka when:
  - High throughput (>1M msg/s)
  - Message replay needed (audit log, reprocessing)
  - Multiple independent consumer groups (each reads full stream)
  - Event sourcing / CDC / stream processing
  - Long retention (days to forever)
  - Ordered processing within a partition is required

AWS SQS and SNS

  SQS (Simple Queue Service):

  - Pull-based: consumers poll for messages
  - Visibility timeout: message hidden for N seconds while processed
    (prevents duplicate processing by concurrent consumers)
  - If consumer crashes before ack: message becomes visible again
    (at-least-once delivery)
  - Dead-letter queue (DLQ): after maxReceiveCount failures, move to DLQ

  SQS Standard: at-least-once, best-effort ordering
  SQS FIFO:     exactly-once, strict ordering, 3,000 msg/s limit

  SNS (Simple Notification Service):

  - Push-based pub/sub: SNS pushes to subscribers
  - Fan-out: one SNS topic → multiple SQS queues + HTTP endpoints + Lambda
  - Use: broadcast an event to multiple independent processors

  SNS → SQS fan-out pattern:

  Order Created Event
        │
        ▼
  SNS topic "order-events"
        │
  ┌─────┼─────┬─────────┐
  ▼           ▼         ▼
  SQS:        SQS:      SQS:
  "analytics" "email"   "inventory"
  (Redshift)  (SendGrid)(warehouse)

Event-Driven Patterns

Event Sourcing

Instead of storing current state, store the sequence of events that produced that state:

  Traditional (state-based):
  Orders table: {id: 123, status: "shipped", total: 99.99}

  Event Sourcing:
  Event log for order 123:
  1. OrderCreated   {items: [...], total: 99.99}   t=10:00
  2. PaymentApplied {amount: 99.99, method: "cc"} t=10:01
  3. ItemPicked     {warehouseId: "NYC-1"}         t=10:30
  4. Shipped        {trackingId: "1Z..."}          t=11:00

  Current state = "fold" (reduce) event log

  Benefits:
  - Full audit trail (every state change recorded)
  - Time travel: replay events to any point in time
  - Event stream is the canonical record (for Kafka CDC)

  Drawbacks:
  - Querying current state requires replaying events (use snapshots)
  - Schema evolution is hard (old events must stay valid)
  - Event ordering complexity (concurrent commands)

CQRS (Command Query Responsibility Segregation)

  CQRS separates write model from read model:

  Write side (Command):
    Command: PlaceOrder(items, customerId)
    → Validate → Event: OrderPlaced
    → Append to event store (append-only)

  Read side (Query):
    Event: OrderPlaced → update read models:
    - orders_by_customer materialized view (for "my orders" page)
    - inventory_summary view (for stock levels)
    - analytics_cube (for business intelligence)

  Client queries hit read models (denormalized, fast)
  Client commands hit write model (consistent, normalized)

  Tradeoff: eventual consistency between write and read sides
  Read models updated asynchronously via event handlers

Saga Pattern for Distributed Transactions

  Problem: Order service needs to:
  1. Reserve inventory
  2. Charge payment  
  3. Dispatch shipping

  Each in a different service. No distributed 2PC.

  Saga (choreography via events):

  OrderService        InventoryService     PaymentService     ShippingService
       │                    │                   │                   │
  OrderCreated ────────────>│                   │                   │
       │              InventoryReserved ────────>│                   │
       │                    │             PaymentCharged ────────────>
       │                    │                   │          ShipmentDispatched
       │ ◄─────────────────────────────────────────────── OrderFulfilled

  Compensating transactions (rollback on failure):

  If PaymentFailed:
       │                    │              PaymentFailed             │
       │ ◄─────────────────── InventoryReleased (compensation)      │
  OrderCancelled             │                                       │

  Orchestration saga (alternative):
    SagaOrchestrator calls each service in sequence,
    tracks state, issues compensations if any step fails.
    More explicit control flow, but orchestrator is a bottleneck.

Debugging Notes

# Kafka debugging:

# Check broker health:
kafka-broker-api-versions.sh --bootstrap-server kafka:9092

# List topics and partition counts:
kafka-topics.sh --bootstrap-server kafka:9092 --list
kafka-topics.sh --bootstrap-server kafka:9092 --describe --topic orders

# Consume messages from a topic (from beginning):
kafka-console-consumer.sh --bootstrap-server kafka:9092 \
  --topic orders --from-beginning --max-messages 10

# Check ISR status (under-replicated partitions = broker issue):
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --under-replicated-partitions

# Reset consumer group offset (to re-process):
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --group order-processor --topic orders \
  --reset-offsets --to-earliest --execute

# RabbitMQ debugging:
# Check queue depths:
rabbitmqctl list_queues name messages consumers
# Look for queues with high message count and 0 consumers

# Check for unacked messages (consumers connected but not acking):
rabbitmqctl list_queues name messages_unacknowledged messages_ready

# SQS debugging:
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/orders \
  --attribute-names ApproximateNumberOfMessages,\
    ApproximateNumberOfMessagesNotVisible

Security Implications

Kafka topics containing PII (Personally Identifiable Information) should use encryption at rest (Kafka supports SCRAM + TLS for transport, but disk encryption is OS-level).
Kafka ACLs (access control lists) restrict which client IDs can produce/consume which topics. Always enable in multi-tenant clusters.
Message queues create attack surfaces: a compromised producer can inject malicious messages that exploit deserialization vulnerabilities in consumers. Use schema registries (Avro/Protobuf) to validate message shape.
RabbitMQ management UI (port 15672) should never be publicly exposed. VPN-only access.
Dead-letter queues can accumulate sensitive data indefinitely. Apply retention policies and access controls to DLQs.

Performance Implications

Kafka throughput is dominated by disk sequential write speed. Batch writes (linger.ms, batch.size) dramatically improve throughput at cost of added latency.
Consumer throughput: number of partitions = parallelism ceiling. Under-partitioned topics cannot be scaled by adding consumers.
RabbitMQ performance degrades sharply when queue depth grows large (in-memory queue structures). Design for steady-state consumption; use lazy queues (disk-backed) for bursty producers.
Message size matters: Kafka is optimized for many small messages, not few large ones. Split large payloads (>1MB) into chunks or store in object storage and put the reference in the message.

Failure Modes

Symptom	Cause	Fix
Consumer lag grows unbounded	Consumer too slow	Scale consumers (add partitions first)
Messages lost on broker restart	acks=1, no min.insync.replicas	Use acks=all + min.insync.replicas=2
Duplicate messages processed	Consumer crash before ack, restart redelivers	Idempotent consumer with dedup by message ID
Kafka partition assignment stuck	Consumer group rebalancing in progress	Check for stuck consumers with long processing times
RabbitMQ memory alarm	Queue backed up, consumers down	Check consumers, restart, clear with shovel plugin
Poison pill message	Bad message causes consumer to crash + restart	Dead-letter after N failures; inspect DLQ

Modern Usage

Kafka Streams and Apache Flink enable stateful stream processing directly on Kafka topics without a separate batch layer. Used for real-time fraud detection, personalization, and operational analytics at companies like LinkedIn, Uber, and Lyft.
Outbox pattern: Ensures database writes and message publishes are atomic without distributed transactions. Write to an "outbox" table in the same DB transaction; a separate process reads the outbox and publishes to Kafka/SQS.
CloudEvents: CNCF standard for event schema and metadata, enabling interoperability between event systems (Kafka, EventBridge, Pub/Sub) and serverless runtimes.

Future Directions

Pulsar vs Kafka: Apache Pulsar separates compute (brokers) from storage (BookKeeper), enabling independent scaling. Growing adoption for multi-tenancy and tiered storage use cases.
NATS JetStream: Lightweight, high-performance messaging with persistence, designed for edge/IoT and low-latency microservices.
WarpStream / Confluent Shared Kafka: Kafka-compatible APIs backed by object storage (S3) — eliminates disk management, enables infinite retention at S3 prices.

Exercises

Deploy a local Kafka (via Docker Compose) with 3 brokers and create a topic with 6 partitions. Write a producer sending 10,000 messages. Run 3 consumers in the same group. Verify each partition is owned by exactly one consumer. Kill one consumer and observe partition rebalancing.
Implement the outbox pattern: write a service that inserts a record into a database AND publishes an event, using the outbox table to guarantee atomicity. Test behavior when the message broker is unavailable.
Implement an idempotent consumer that uses a Redis set to track processed message IDs. Simulate duplicate delivery by sending the same message twice. Verify the business logic runs exactly once.
Set up consumer lag alerting: configure a Prometheus alert that fires when consumer group lag exceeds 10,000 messages for 5 minutes. Trigger the alert by pausing a consumer, then verify recovery.
Implement a saga for a simplified order flow (order → payment → inventory). Use three separate services communicating via events. Test the compensating transaction path: simulate payment failure and verify inventory is released.

References

Jay Kreps, "The Log: What every software engineer should know about real-time data's unifying abstraction" — LinkedIn Engineering Blog (2013, foundational Kafka essay)
"Kafka: The Definitive Guide" — Gwen Shapira, Neha Narkhede, Todd Palino, O'Reilly
"Designing Data-Intensive Applications" — Martin Kleppmann, Chapter 11 (stream processing)
"Enterprise Integration Patterns" — Gregor Hohpe & Bobby Woolf (canonical messaging patterns reference)
Chris Richardson, "Microservices Patterns" — Chapter 4 (Saga pattern)
AMQP 0-9-1 specification: rabbitmq.com/protocol.html
Confluent Kafka documentation: docs.confluent.io
AWS SQS Developer Guide: docs.aws.amazon.com/sqs
CloudEvents specification: cloudevents.io