Kafka Broker Internals: Log Storage and Replication
To build highly scalable and fault-tolerant streaming systems, you must understand how Apache Kafka manages data under the hood. At its core, a Kafka broker is a highly optimized commit log manager. It does not use a traditional relational database or a complex tree-based storage engine. Instead, it writes incoming messages sequentially to the disk and replicates them across multiple brokers to guarantee durability.
In this guide, we will dive deep into the internals of Kafka broker storage, dissect the structure of log segments, and explore the mechanics of partition replication, ISR (In-Sync Replicas), and data recovery.
1. Log Storage Anatomy: How Kafka Stores Data on Disk
In Kafka, a topic is a logical concept. Physically, a topic is divided into one or more partitions. Each partition maps directly to a directory on the broker's local file system. If you look inside a Kafka data directory, you will find folders named in the format of topic-name-partition-index.
Inside each partition directory, Kafka stores messages in append-only log files. Writing to a single giant file would make cleanup and searches incredibly slow. To solve this, Kafka splits each partition into smaller physical files called Log Segments.
The Structure of a Log Segment
A log segment consists of three primary file types on disk, all sharing the same base name which represents the starting offset of that segment:
- The Log File (
.log): This file contains the actual message payloads, headers, and metadata written sequentially. - The Offset Index File (
.index): A sparse index that maps Kafka logical offsets to physical byte positions within the.logfile. This allows Kafka to perform fast O(1) lookups when a consumer requests data from a specific offset. - The Time Index File (
.timeindex): This file maps message timestamps to logical offsets, enabling consumers to search and consume data starting from a specific point in time.
Partition Directory: my-topic-0/
โ
โโโ 00000000000000000000.log (Messages from offset 0 to 999)
โโโ 00000000000000000000.index (Sparse index for offset lookup)
โโโ 00000000000000000000.timeindex (Timestamp index for time-based lookup)
โ
โโโ 00000000000000001000.log (Active segment: messages from offset 1000+)
โโโ 00000000000000001000.index
โโโ 00000000000000001000.timeindex
Active Segment vs. Completed Segments
At any given time, only one segment is open for writing. This is called the Active Segment. When the active segment reaches its size limit (configured by log.segment.bytes, defaulting to 1 GB) or time limit (configured by log.roll.hours, defaulting to 7 days), Kafka rolls it. The active segment becomes read-only, and a new active segment is created to receive new writes.
2. Deep Dive into Replication Mechanics
Replication is the backbone of Kafka's high availability. Each partition can have multiple copies distributed across different brokers in the cluster. The number of copies is defined by the replication factor.
Leader and Follower Replicas
For every partition, one broker is designated as the Leader, and the remaining replicas are designated as Followers.
- The Leader: Handles all read and write requests from clients (producers and consumers). It is the single source of truth for the partition.
- The Followers: Do not handle client requests by default (unless specifically configured for nearest-replica reads). Their sole job is to replicate data from the leader sequentially, acting as passive consumers.
In-Sync Replicas (ISR)
An In-Sync Replica (ISR) is a follower replica that is actively keeping up with the leader. A follower is considered "in-sync" if it has caught up with the leader's latest messages within a configurable time window (defined by replica.lag.time.max.ms).
If a follower broker crashes or experiences heavy network latency, it falls behind. Once the lag exceeds the threshold, the leader removes this follower from the ISR pool. If the leader broker fails, only a member of the ISR pool can be elected as the new leader to prevent data loss.
3. High Watermark (HW) and Log End Offset (LEO)
To maintain consistency across replicas, Kafka tracks two critical offsets for every partition:
- Log End Offset (LEO): The offset of the next message to be written to a replica's log. Every replica (leader and follower) maintains its own LEO.
- High Watermark (HW): The highest offset that has been successfully copied to all In-Sync Replicas. Consumers can only read messages up to the High Watermark. This prevents consumers from reading uncommitted data that might be lost if the leader crashes before replication completes.
Leader Replica Log:
[ Offset 0 ] -> [ Offset 1 ] -> [ Offset 2 ] -> [ Offset 3 ] (LEO = 4)
^
|-- High Watermark (HW = 2)
Follower 1 (In ISR):
[ Offset 0 ] -> [ Offset 1 ] -> [ Offset 2 ] (LEO = 3)
Follower 2 (Lagging):
[ Offset 0 ] -> [ Offset 1 ] (LEO = 2)
As shown in the diagram, even though the leader has written up to offset 3, the High Watermark remains at offset 2 because Follower 1 has only replicated up to offset 2, and Follower 2 is lagging further behind. Consumers will not be allowed to read offset 3 until Follower 1 replicates it.
4. Important Broker Configuration Properties
To control how Kafka stores and replicates data, you must configure your brokers using properties in the server.properties file. Below are the most critical configurations:
log.dirs: A comma-separated list of local directories where Kafka stores its log segments. For production, distribute these across multiple physical disks to maximize I/O throughput.log.segment.bytes: The maximum size of a single log segment file. Default is 1,073,741,824 bytes (1 GB).log.cleanup.policy: Determines how old data is discarded. Options aredelete(discards old segments based on time or size) orcompact(keeps only the latest message for each key).min.insync.replicas: The minimum number of in-sync replicas that must acknowledge a write for the write to be considered successful when a producer usesacks=all.replica.lag.time.max.ms: The maximum time a follower can lag behind the leader before being removed from the ISR. Default is 30,000 ms (30 seconds).
5. Common Mistakes and How to Avoid Them
Misconfiguring min.insync.replicas
A common mistake is setting min.insync.replicas equal to the total replication factor of the topic. For example, if your replication factor is 3, and you set min.insync.replicas=3, the failure of a single broker will make the entire partition unavailable for writes (producers using acks=all will receive a NotEnoughReplicasException). A better practice for a replication factor of 3 is to set min.insync.replicas=2, allowing the system to tolerate one broker outage.
Ignoring Disk I/O Bottlenecks
Because Kafka relies heavily on the OS page cache and write-ahead logs, slow disk I/O can severely degrade cluster performance. Avoid using shared network drives (like NFS) or slow magnetic disks for high-throughput production workloads. Utilize high-speed SSDs or multiple dedicated NVMe drives mapped to log.dirs.
Setting Too Small Segment Sizes
Setting log.segment.bytes to a very small value (e.g., a few megabytes) causes Kafka to create thousands of small files. This exhausts file descriptors on the operating system and degrades partition performance due to frequent file rolling and index lookups.
6. Real-World Use Case: Financial Transaction Ledger
In financial systems, data loss is unacceptable. Let's look at how Kafka's log storage and replication internals are configured to guarantee zero data loss for a transaction processing system.
To ensure absolute durability, the system is configured with a replication factor of 3, spanning three different availability zones (AZs). The producer writes transactions with maximum durability guarantees:
// Producer Configuration for Zero Data Loss
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092,broker2:9092,broker3:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// Force leader to wait for all ISRs to acknowledge the write
props.put("acks", "all");
// Enable idempotence to prevent duplicate writes during retries
props.put("enable.idempotence", "true");
On the broker side, the topic is configured with min.insync.replicas=2. When a transaction message arrives, the leader broker writes it to its active log segment. It does not send a success acknowledgment back to the producer until at least one other ISR follower replicates the message to its own log segment, advancing the High Watermark. If one AZ goes completely offline, the system continues running smoothly without losing a single transaction.
7. Interview Notes for Technical Candidates
- How does Kafka achieve high write performance despite writing to disk? Kafka uses sequential writes, which are as fast as sequential network access. It also heavily leverages the OS page cache and utilizes the Zero-Copy optimization (via the
sendfilesystem call) to transfer data directly from the page cache to the network socket without copying it into user space. - What is the difference between Log End Offset (LEO) and High Watermark (HW)? LEO is the absolute end of the log where the next message will be written, regardless of whether it is replicated. HW is the offset of the last message that has been successfully replicated to all members of the ISR. Consumers can never read past the HW.
- What happens if a follower replica fails and recovers? Upon recovery, the follower reads its checkpoint file to find its last High Watermark. It truncates its log to its previous HW to remove any uncommitted messages, then begins fetching data from the leader starting at that offset until it catches up and rejoins the ISR.
- What is a Leader Epoch? It is a monotonic sequence number used to prevent log divergence during broker failures. It replaces the older, error-prone method of relying solely on High Watermarks for log truncation when a new leader is elected.
8. Summary
Kafka's remarkable performance and reliability stem directly from its internal architecture. By organizing partitions as sequential log segments, Kafka turns random disk access into fast sequential access. By tracking In-Sync Replicas (ISR), Log End Offsets (LEO), and High Watermarks (HW), Kafka ensures that your data remains highly available and protected against broker failures without sacrificing throughput.
Understanding these broker internals is essential for configuring clusters, troubleshooting lag, and designing robust streaming applications. In the next topics, we will explore how to monitor these metrics to ensure your production clusters run at peak performance.