03 — Kafka Internals
Overview
Apache Kafka is a distributed event streaming platform built on the abstraction of a persistent, replicated, ordered log. Created at LinkedIn by Jay Kreps, Neha Narkhede, and Jun Rao in 2011, Kafka was designed to solve a problem that traditional message queues could not: how do you build a unified data pipeline that can serve real-time consumers at low latency while also allowing batch consumers to replay historical data, while persisting at disk speeds for weeks? The answer was to take the commit log — the most fundamental data structure in databases — and make it the primary abstraction for a message broker.
Historical Context
Before Kafka, LinkedIn's data pipeline was a custom mess of point-to-point connections: Oracle GoldenGate for database replication, a custom batch pipeline for analytics, separate systems for activity data (clicks, impressions). Adding a new downstream consumer required a new pipeline, a new integration, a new operational burden. Data was siloed and stale.
Jay Kreps had worked on Voldemort (LinkedIn's distributed key-value store) and was influenced by the log-centric architecture of databases. The insight: a database's write-ahead log is the most reliable, most performant way to record and replay state transitions. What if you exposed that log as a service?
Kafka was open-sourced in 2011 and became an Apache top-level project in 2012. By 2014, Confluent was founded (by the Kafka original authors) to commercialize it. Today, Kafka is the de facto standard for real-time data pipelines and event-driven architectures.
Core Concepts
Topic, Partition, Offset
Kafka Topic: "user-events"
┌─────────────────────────────────────────────────────────────────┐
│ Partition 0: │
│ [offset 0][offset 1][offset 2][offset 3]...[offset N] → tail │
├─────────────────────────────────────────────────────────────────┤
│ Partition 1: │
│ [offset 0][offset 1][offset 2]...[offset M] → tail │
├─────────────────────────────────────────────────────────────────┤
│ Partition 2: │
│ [offset 0][offset 1]...[offset K] → tail │
└─────────────────────────────────────────────────────────────────┘
A topic is a named category of events. Topics are divided into partitions — ordered, immutable, append-only sequences of records. Each record within a partition has a monotonically increasing offset — its unique position in that partition.
Partitions are the unit of parallelism in Kafka: - Producer parallelism: Multiple producer threads write to different partitions in parallel. - Consumer parallelism: Within a consumer group, each partition is assigned to exactly one consumer. With 12 partitions and 12 consumers, each consumer reads one partition — fully parallel. - Replication: Each partition is replicated independently across brokers.
Consumer Groups
A consumer group is a named set of consumers that collectively consume a topic. Each partition is consumed by exactly one consumer in the group at any time. Different consumer groups maintain independent offsets — a "real-time" consumer group can be at the tail of the log while a "batch reprocessing" consumer group reads from offset 0.
Topic: user-events (3 partitions)
Consumer Group A (real-time, 3 consumers):
Consumer A1 → Partition 0 (at offset 10000)
Consumer A2 → Partition 1 (at offset 9500)
Consumer A3 → Partition 2 (at offset 10200)
Consumer Group B (analytics, 1 consumer — underprovision):
Consumer B1 → Partition 0, Partition 1, Partition 2 (reading all)
(one consumer can read multiple partitions but it is a bottleneck)
Broker Internals
The Partition Log on Disk
A Kafka partition is an append-only log on disk. Internally, it is stored as a set of
segment files in a directory named <topic>-<partition>/:
/kafka-logs/user-events-0/
00000000000000000000.log (segment: offsets 0 - 999999)
00000000000000000000.index (sparse offset → file position index)
00000000000000000000.timeindex
00000000001000000000.log (segment: offsets 1M - 1999999)
00000000001000000000.index
...
00000000009800000000.log (active segment: tail appended here)
Each .log file is a sequence of message batches. Each batch has a header (base offset, count,
CRC, compression type, timestamp range) followed by compressed records.
The .index file is a sparse offset-to-position index: it maps every N-th offset to its
byte position in the .log file. Finding a specific offset requires: binary search in the
index to find the nearest indexed offset, then a sequential scan forward in the .log file.
OS Page Cache — No Double Buffering
Kafka intentionally avoids maintaining its own buffer cache. All log.write() and log.read()
go through the Linux VFS and OS page cache. This design has profound implications:
- Write performance:
FileChannel.write()writes to the OS page cache, which is asynchronously flushed to disk by the kernel's dirty page writeback. Kafka producers see memory-speed writes. - Read performance (consumer lag = 0): Active consumers reading at the tail of the log read directly from the OS page cache — no disk I/O at all. The log data written by the producer is still in the page cache.
- JVM heap pressure: Kafka brokers run with a small JVM heap (typically 6–8GB on a machine with 128GB RAM). The remaining 120GB is available to the OS page cache.
- No double buffering: Kafka avoids the double-buffering antipattern where data sits in the JVM heap AND in the OS page cache simultaneously.
Zero-Copy Consumer Reads
When a consumer reads from a partition, Kafka uses the sendfile(2) (or FileChannel.transferTo)
system call to transfer data directly from the OS page cache to the network socket without
copying it through userspace:
Traditional path (4 copies):
Disk → OS page cache → (read syscall) → JVM heap → (send syscall) → socket buffer → NIC
Zero-copy path (2 copies):
Disk → OS page cache ────────────────────────────────────► NIC
(sendfile: kernel copies directly)
This is a major throughput win: the CPU is not involved in data movement at all for consumer reads.
Replication: ISR Model
Kafka uses a leader-follower replication model. Each partition has one leader broker and zero or more follower brokers (replicas). All reads and writes go to the leader.
In-Sync Replicas (ISR)
The ISR is the set of replicas that are "in sync" with the leader — defined as: have fetched
up to the leader's latest offset within replica.lag.time.max.ms (default: 30 seconds).
A message is considered committed when all replicas in the ISR have acknowledged it. Consumers can only read committed messages.
ISR Example:
Partition leader: Broker 1 (offset 1000)
ISR: {Broker 1, Broker 2, Broker 3}
Producer sends message at offset 1001:
Broker 1 appends → offset 1001 (not yet committed)
Broker 2 fetches → acknowledges offset 1001
Broker 3 fetches → acknowledges offset 1001
ISR all acked → message committed (visible to consumers)
If Broker 3 falls behind (network issue):
ISR shrinks to {Broker 1, Broker 2}
Message is committed when Broker 1 + 2 ack
Broker 3 is removed from ISR (marked as out-of-sync)
When Broker 3 catches up → rejoins ISR
Leader Epoch
The leader epoch is a monotonically increasing integer assigned each time a new leader is elected for a partition. Epoch-stamped fetch requests prevent log divergence — the scenario where a recovering broker accepts writes from an old leader that have already been overwritten by a new leader.
Preferred Replica Election
By default, the first broker in a partition's replica list is the "preferred leader." If a different broker becomes leader (due to failover), Kafka's auto-rebalance will eventually re-elect the preferred replica as leader, ensuring load is evenly distributed across brokers.
Producer Semantics
Acknowledgment Levels
acks=0: Fire and forget. No response from broker. Fastest, lowest durability.
acks=1: Leader acknowledges after writing to local log. Follower lag → potential loss.
acks=all: All ISR members must acknowledge. Highest durability; adds replication latency.
acks=all with min.insync.replicas=2 is the standard durability configuration: requires
at least 2 ISR members (leader + 1 follower), rejects writes if ISR shrinks below 2.
Idempotent Producer
The idempotent producer (enabled by enable.idempotence=true) adds a producer ID (PID) and
sequence number to each message batch. If the broker receives a duplicate (retransmission after
a network failure), it deduplicates based on (PID, sequence number) and returns success without
writing again. This prevents the "at-least-once → exactly-once for the producer" upgrade.
Transactions
Kafka transactions (Kafka 0.11+) enable atomic writes across multiple partitions: all messages
in a transaction either commit (and become visible to consumers with isolation.level=
read_committed) or abort. The transactional coordinator protocol uses a two-phase commit
within Kafka itself (the __transaction_state internal topic serves as the coordinator log).
Kafka Connect and Kafka Streams
Kafka Connect is a framework for building scalable, reliable connectors between Kafka and external systems (databases, S3, Elasticsearch, HDFS). Connectors are plugins that run in the Connect worker process. Source connectors pull from external systems into Kafka topics; sink connectors push from Kafka topics to external systems. Kafka Connect handles offset management, error handling, schema registry integration, and distributed worker coordination.
Kafka Streams is a client library (not a separate cluster) for building stream processing applications on top of Kafka topics. It provides a DSL for transformations, joins, and aggregations with stateful operations backed by a local RocksDB store (with Kafka as the changelog log for fault tolerance). Unlike Apache Flink or Spark Streaming, Kafka Streams runs inside the application process — no separate processing cluster needed.
KRaft: Removing ZooKeeper
For most of Kafka's history, ZooKeeper was required for: broker metadata, controller election, topic configuration, consumer group coordination (older clients), and ISR management. ZooKeeper was a significant operational dependency and a scaling bottleneck for clusters with many partitions.
KRaft (Kafka Raft, KIP-500) replaces ZooKeeper with a Raft-based metadata quorum built into
Kafka itself:
- The KRaft controller quorum (typically 3 or 5 dedicated controller nodes) uses the Raft
consensus protocol to maintain cluster metadata.
- Metadata is stored in an internal __cluster_metadata Kafka topic (a log, naturally).
- Brokers fetch metadata from the controller quorum rather than ZooKeeper.
KRaft was production-ready in Kafka 3.3 (2022). ZooKeeper mode is deprecated as of Kafka 3.5 (2023) and will be removed in Kafka 4.0. The operational benefit: one fewer distributed system to run. The scaling benefit: KRaft supports 1 million partitions per cluster (vs ~200K with ZK).
Kafka vs Competing Systems
┌─────────────────┬─────────────────────────────────┬─────────────────────────────────┐
│ Property │ Kafka │ RabbitMQ │
├─────────────────┼─────────────────────────────────┼─────────────────────────────────┤
│ Model │ Log (append-only, replay) │ Queue (messages deleted on ack) │
│ Ordering │ Per-partition strict ordering │ Per-queue (with caveats) │
│ Replay │ Yes (seek to any offset) │ No (messages consumed once) │
│ Throughput │ Very high (millions msg/s) │ Moderate (100K msg/s) │
│ Latency │ Low ms │ Sub-ms possible │
│ Consumer model │ Pull │ Push │
│ Use case │ Event streaming, CDC, pipelines │ Task queues, work distribution │
├─────────────────┼─────────────────────────────────┼─────────────────────────────────┤
│ Property │ Kafka │ Apache Pulsar │
├─────────────────┼─────────────────────────────────┼─────────────────────────────────┤
│ Storage │ Broker-local (+ Tiered Storage) │ Apache BookKeeper (separate) │
│ Multi-tenancy │ Limited │ Built-in namespaces │
│ Geo-replication │ MirrorMaker 2 (complex) │ Built-in │
│ Scaling │ Partition rebalancing required │ Segment-level auto-scale │
│ Maturity │ Very high (13+ years) │ High but younger │
└─────────────────┴─────────────────────────────────┴─────────────────────────────────┘
Debugging Notes
Consumer lag: kafka-consumer-groups.sh --describe shows lag per partition. Lag > 0 means
consumers are slower than producers. Check consumer CPU (deserialize bottleneck), downstream
system throughput, and network bandwidth.
ISR shrinkage: Watch for UnderReplicatedPartitions metric. ISR shrinkage can be caused by:
slow follower disk I/O, GC pauses on follower brokers, or network congestion. Any ISR shrinkage
with acks=all reduces write durability.
Log retention and disk usage: Kafka retains by time (log.retention.hours) or by size
(log.retention.bytes). Unexpected disk growth is usually a topic with no consumers (no offset
advances, but retention timer is still running — use log compaction or size-based retention for
such topics).
Producer batch size tuning: Small batches (< 16KB) dramatically reduce throughput due to
per-batch overhead. Tune batch.size and linger.ms for high-throughput producers.
Security Implications
- Kafka ACLs: Topic-level ACLs (stored in ZooKeeper or KRaft) control producer/consumer access. Without ACLs, any client in the network can read or write any topic.
- TLS in transit: Without TLS, Kafka traffic (potentially containing sensitive events) is cleartext on the network. Enable SSL listeners for all production deployments.
- Schema Registry + Avro: Without schema enforcement, a buggy producer can write garbage bytes that crash all downstream consumers. Confluent Schema Registry enforces Avro/JSON schemas on all messages.
- Uncontrolled consumer groups: Without ACLs, any consumer can create a consumer group and
consume from any topic, including internal compacted topics like
__consumer_offsets.
Performance Implications
- Partition count: More partitions = more parallelism = higher throughput, but: more file handles (OS limit!), longer leader election time on broker failure, more memory per broker. Rule of thumb: start with partitions = 2 × max_consumers, multiply by throughput headroom.
- Segment size: Larger segments reduce index files but increase memory mapped file count. Smaller segments enable faster log compaction but increase disk overhead.
- Compression: LZ4 is the best latency/throughput tradeoff for most Kafka workloads. Zstandard (zstd) provides better compression ratios for storage-heavy topics.
Failure Modes
| Failure | Detection | Recovery |
|---|---|---|
| Broker crash | Controller detects missing heartbeat | Leader election for all partitions on that broker (~30s for large clusters) |
| ISR shrink to 1 | UnderReplicatedPartitions > 0 |
acks=all writes blocked if min.insync.replicas not met |
| ZooKeeper loss (old mode) | All brokers disconnect | Cluster read-only until ZK restored |
| KRaft controller failure | Raft re-election | ~5s failover for 3-node quorum |
| Disk full on broker | DiskFull exception |
Producers get errors; operator must expand or purge |
Future Directions
Tiered Storage: Kafka 3.6+ supports tiered storage — offload older log segments to object storage (S3, GCS) while keeping recent segments on local disk. Brokers can serve reads from S3 for historical consumption. This separates storage scaling from compute scaling.
KIP-932 (Queues for Kafka): A proposal to add a "share group" consumer model — unlike consumer groups (each partition to one consumer), share groups allow multiple consumers to process messages from the same partition (like a traditional queue). Useful for work distribution workloads.
Exercises
- Set up a local Kafka cluster (KRaft mode) with 3 brokers. Create a topic with 6 partitions
and replication factor 3. Kill one broker and observe ISR changes via
kafka-topics.sh. - Write a producer that sends 1 million messages and measure throughput with
acks=1vsacks=all. Explain the difference. - Trace the full lifecycle of a message: from producer.send() through network, broker storage, replication, consumer.poll(), to commit. Identify every I/O operation.
- Implement a simple consumer group with 3 consumers reading from a 3-partition topic. Then kill one consumer and observe partition rebalancing.
- Configure log compaction on a topic. Send 100 messages with the same key. Verify via
kafka-dump-log.shthat older records are compacted away.
References
- Kreps, J., Narkhede, N., Rao, J. "Kafka: A Distributed Messaging System for Log Processing." NetDB Workshop, 2011.
- Kreps, J. "The Log: What every software engineer should know about real-time data's unifying abstraction." LinkedIn Engineering Blog, 2013.
- Kafka Documentation:
https://kafka.apache.org/documentation/ - KIP-500 (KRaft):
https://cwiki.apache.org/confluence/x/KoxiBw - Confluent Kafka Internals Blog:
https://www.confluent.io/blog/how-kafka-is-used-by-netflix/