Published: 2026-06-01 • Updated: 2026-06-02

Kafka Controller, ZooKeeper, and KRaft Mode

In a distributed system like Apache Kafka, coordination is everything. With multiple brokers running across different servers, how does the cluster decide which broker is responsible for a partition? How do brokers know when a new topic is created, or when a peer broker has crashed? This critical coordination is managed by the Kafka Controller.

Historically, Kafka relied heavily on Apache ZooKeeper to manage this cluster state and elect the Controller. However, as modern data streaming demands grew, ZooKeeper became a major scalability bottleneck. This led to the creation of KRaft (Kafka Raft) mode, which removes ZooKeeper entirely and manages metadata directly within Kafka itself. This guide explores how the Controller works, the legacy ZooKeeper architecture, and the modern KRaft consensus protocol.

Understanding the Kafka Controller

The Kafka Controller is simply a standard Kafka broker that takes on additional administrative responsibilities. In any active Kafka cluster, exactly one broker acts as the active Controller.

Key Responsibilities of the Controller

  • Partition Leader Election: When a broker fails, all partition leaders hosted on that broker become unavailable. The Controller detects this and immediately elects new leaders from the remaining In-Sync Replicas (ISR) for those partitions.
  • Metadata Propagation: Whenever a metadata change occurs (such as a new leader election, topic creation, or broker join/leave event), the Controller is responsible for broadcasting these updates to all other brokers in the cluster. This ensures every broker has an identical, up-to-date view of the cluster topology.
  • Topic Management: The Controller orchestrates the creation, deletion, and partition expansion of topics, ensuring that partition replicas are correctly allocated across available brokers.
  • Replica State Machine: It tracks the status of all partition replicas (whether they are online, offline, or transitioning) to maintain high availability.

The Legacy Era: Kafka with ZooKeeper

For over a decade, Apache ZooKeeper served as the external source of truth and coordination engine for Kafka. ZooKeeper is a highly consistent, distributed hierarchical key-value store.

+-------------------------------------------------------------+
|                      ZOOKEEPER CLUSTER                      |
|             (Stores metadata, handles elections)            |
+-------------------------------------------------------------+
       ^                       ^                       ^
       | watches               | writes                | watches
       v                       v                       v
+---------------+       +---------------+       +---------------+
|   Broker 1    |       |   Broker 2    |       |   Broker 3    |
|               |       | (Controller)  |       |               |
+---------------+       +---------------+       +---------------+

In a ZooKeeper-based cluster, Kafka brokers do not communicate directly with each other to elect a Controller. Instead, they interact with ZooKeeper:

  • Broker Registration: When a broker starts up, it registers itself by creating an ephemeral node (znode) under /brokers/ids in ZooKeeper. If the broker crashes, its session times out, and ZooKeeper automatically deletes this ephemeral node.
  • Controller Election: The first broker to start up attempts to create an ephemeral znode at /controller. The broker that successfully creates this node becomes the active Controller. Other brokers watch this path. If the active Controller crashes, its ephemeral node is deleted, triggering a watch event that prompts the remaining brokers to attempt to create the /controller node again. The winner of this race becomes the new Controller.
  • State Storage: All definitive cluster metadata—including topic configurations, partition leader locations, and the current ISR list—is stored inside ZooKeeper.

Why Move Away from ZooKeeper? (The Bottlenecks)

While ZooKeeper served Kafka well for years, it introduced several severe architectural limitations as cluster scales increased:

  • Metadata Duplication and Desynchronization: Both ZooKeeper and the active Controller maintained copies of the cluster metadata. The Controller had to constantly synchronize its memory state with ZooKeeper and then propagate those changes to the other brokers. This double-handling of metadata introduced lag and potential desynchronization bugs.
  • The Partition Limit Bottleneck: Because ZooKeeper is not designed to handle massive write throughput, Kafka clusters were practically limited to around 200,000 partitions. Exceeding this limit caused ZooKeeper's hierarchical storage to experience severe latency spikes.
  • Slow Failover Times: If the active Controller broker crashed in a large cluster with hundreds of thousands of partitions, the newly elected Controller had to read the entire cluster state from ZooKeeper into its local memory before it could start processing requests. This recovery process could take several minutes, during which the cluster could not process metadata changes or handle new broker failures.
  • Operational Complexity: Administrators had to deploy, configure, secure, monitor, and scale two entirely different distributed systems (Kafka and ZooKeeper), each with its own configuration syntax, port allocations, and security models.

The Modern Era: KRaft Mode (Kafka Raft)

Introduced in KIP-500, KRaft (Kafka Raft metadata mode) replaces ZooKeeper entirely. Instead of relying on an external coordinator, Kafka now manages its own metadata using a built-in consensus protocol based on the Raft algorithm.

+-------------------------------------------------------------+
|                      METADATA QUORUM                        |
|             (Active & Standby KRaft Controllers)            |
+-------------------------------------------------------------+
       |                       |                       |
       | replicates log        | replicates log        | replicates log
       v                       v                       v
+---------------+       +---------------+       +---------------+
|   Broker 1    |       |   Broker 2    |       |   Broker 3    |
|               |       |               |       |               |
+---------------+       +---------------+       +---------------+

In KRaft mode, a select group of brokers are designated as Controllers. Together, these controllers form a Metadata Quorum. One of these controllers is elected as the Active Controller (the leader), while the others act as hot standbys.

How KRaft Works

  • The Metadata Log: Instead of storing cluster state in an external tree structure like ZooKeeper, KRaft stores metadata in a special, internal, single-partition Kafka topic named __cluster_metadata.
  • Active Controller writes, Standbys replicate: The Active Controller receives metadata change requests (e.g., "create topic"), writes them to the __cluster_metadata log, and replicates them to the standby controllers using the Raft consensus protocol.
  • Brokers consume the Log: All standard Kafka brokers consume this metadata log directly. They read the log sequentially and apply the changes to their own local memory. This means every broker in the cluster is continuously kept up-to-date with metadata changes in near real-time.

ZooKeeper vs. KRaft: A Detailed Comparison

Understanding the architectural differences between ZooKeeper and KRaft is essential for modern Kafka administration and design:

  • Consensus Engine: ZooKeeper uses Zab (ZooKeeper Atomic Broadcast) running on an external cluster. KRaft uses Raft consensus running natively inside the Kafka process.
  • Metadata Storage: ZooKeeper stores metadata in external znodes. KRaft stores metadata in an internal, replicated log file (__cluster_metadata) on disk.
  • Scalability: ZooKeeper-based clusters scale poorly beyond 200,000 partitions. KRaft-based clusters can easily scale to millions of partitions because metadata is treated as a high-throughput sequential log.
  • Controller Failover Time: Under ZooKeeper, failover can take minutes as the new controller loads data from ZooKeeper. Under KRaft, failover is sub-second (often milliseconds) because standby controllers already have the metadata log fully loaded and replicated in memory.
  • Operational Footprint: ZooKeeper requires managing two separate systems (JVMs, configurations, security). KRaft requires only one system (Kafka), simplifying deployments, especially in Kubernetes environments.

Real-World Use Cases

The transition from ZooKeeper to KRaft has unlocked powerful new capabilities in production environments:

  • Massive Multi-Tenant Clusters: Enterprises running large-scale SaaS platforms can now consolidate multiple smaller Kafka clusters into a single, massive KRaft cluster containing millions of partitions, drastically reducing infrastructure costs.
  • Kubernetes and Cloud-Native Deployments: Deploying Kafka on Kubernetes using operators (like Strimzi) is significantly simpler with KRaft. Removing ZooKeeper eliminates complex pod-disruption budgets, storage claims, and security configurations associated with running a separate stateful coordination layer.
  • Edge Computing and IoT: For resource-constrained edge devices, running both Kafka and ZooKeeper was often resource-prohibitive. KRaft's single-process model allows lightweight Kafka clusters to run efficiently on edge gateways with minimal CPU and memory overhead.

Common Mistakes and How to Avoid Them

  • Configuring ZooKeeper properties in KRaft mode: A common mistake when upgrading to KRaft is leaving legacy configurations like zookeeper.connect in the broker properties file. KRaft brokers will fail to start if ZooKeeper properties are present alongside KRaft properties like process.roles and controller.quorum.voters.
  • Improper Quorum Sizing: Just like ZooKeeper, the KRaft controller quorum requires a majority to function. Always deploy an odd number of controllers (typically 3 or 5) to prevent split-brain scenarios and ensure fault tolerance. A 2-controller setup cannot tolerate any controller failures.
  • Co-locating Controller and Broker Roles in High-Traffic Production: While KRaft allows a single JVM process to act as both a broker and a controller (shared role), this is highly discouraged in production. High data traffic on the broker can starve the controller thread of CPU or memory, leading to metadata timeouts and cluster instability. Keep controller nodes dedicated in production.

Interview Notes

  • What is the primary role of the Kafka Controller? The Controller is a broker responsible for managing partition states, electing partition leaders when brokers fail, and broadcasting metadata changes to all other brokers in the cluster.
  • Why did the Kafka community replace ZooKeeper with KRaft? To eliminate metadata duplication, solve the 200,000 partition scalability limit, reduce controller failover times from minutes to milliseconds, and simplify cluster administration.
  • How does a standby KRaft controller become the active controller during a failover? The standby controllers monitor the active controller via the Raft protocol. If the active controller fails to send heartbeats within the election timeout, the remaining controllers initiate a Raft election. The first candidate to secure a majority of votes from the quorum becomes the new active controller.
  • What is the __cluster_metadata topic? It is an internal, metadata-only log where the active KRaft controller writes all cluster state changes. All brokers and standby controllers read this log to maintain an up-to-date local cache of the cluster's metadata.

Summary

The Kafka Controller acts as the operational brain of a Kafka cluster, orchestrating partition leadership and maintaining cluster-wide consistency. The transition from the legacy, external ZooKeeper-based architecture to the modern, internal KRaft metadata mode represents one of the most significant evolutions in Kafka's history. By managing consensus natively through a replicated metadata log, KRaft enables Kafka to scale to millions of partitions, recover from failures in milliseconds, and operate with a drastically simplified infrastructure footprint.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile