Monitoring and Alerting in Apache Kafka Clusters

Running Apache Kafka in production requires deep visibility into the cluster's health, performance, and resource utilization. Because Kafka is a distributed, high-throughput system, failures in one node can cascade across the cluster if not detected early. Monitoring and alerting ensure high availability, prevent data loss, and maintain low latency for both producers and consumers.

In this guide, we will explore the essential metrics you need to track, how to collect them using Java Management Extensions (JMX), how to build a modern monitoring stack, and how to set up actionable alerts for your production Kafka clusters.

Why Monitoring Kafka is Different

Unlike traditional databases or monolithic applications, Kafka is a distributed log commit system. Monitoring Kafka is not just about checking if the process is running. You must monitor the coordination between brokers, the replication of data, disk I/O limits, and the progress of independent consumer groups. A broker might be running perfectly, but if consumers are falling behind, your entire real-time pipeline is compromised.

Key Kafka Metrics to Monitor

Kafka exposes hundreds of metrics via JMX. To avoid alert fatigue, you should focus on the most critical metrics categorized into four main areas: Broker metrics, Producer metrics, Consumer metrics, and JVM/Host metrics.

1. Infrastructure and Host-Level Metrics

Disk Space Utilization: Kafka persists all messages to disk. If a broker runs out of disk space, it will shut down immediately. Alert when disk usage exceeds 80%.
CPU Usage: High CPU usage can degrade network throughput and increase request latencies.
Network I/O: Kafka is highly dependent on network performance. Monitor network bandwidth to ensure replication and client traffic do not saturate the network interface cards.

2. Broker-Level Metrics (Cluster Health)

Under-Replicated Partitions (URPs): This is the single most important metric for Kafka cluster health. It indicates the number of partitions where one or more follower replicas are not in sync with the leader. In a healthy cluster, this value should always be 0.
Offline Partitions Count: The number of partitions without an active leader. This means the partition is completely offline and read/write operations on it will fail. This value must always be 0.
Active Controller Count: In any Kafka cluster, exactly one broker must act as the controller. If this sum across the cluster is 0, or greater than 1, there is a cluster coordination issue.
Leader Election Rate: High rates of leader elections indicate broker instability or frequent network partition issues.

3. Producer and Consumer Metrics

Consumer Lag: The difference between the latest offset written by the producer and the offset currently read by the consumer group. High lag indicates that consumers cannot keep up with the incoming data rate.
Request Latency: The time taken to process produce or fetch requests. High latency directly impacts application performance.
Failed Produce Requests: The rate of failed write operations from producers, which could indicate authentication, authorization, or network issues.

4. JVM Metrics

Garbage Collection (GC) Pause Time: Long JVM GC pauses (Stop-the-World pauses) can cause brokers to disconnect from ZooKeeper or KRaft controllers, triggering accidental leader elections.
JVM Heap Memory: Ensure Kafka has sufficient heap space (typically 4GB to 6GB is recommended, leaving the rest of the system RAM for the OS page cache).

Kafka Monitoring Architecture

A standard production-grade monitoring pipeline collects JMX metrics from Kafka brokers, stores them in a time-series database, and visualizes them on dashboards. Below is the typical flow of monitoring data:

+-------------------------------------------------------------+
|                     Kafka Cluster                           |
|  [Broker 1 (JMX)]     [Broker 2 (JMX)]     [Broker 3 (JMX)] |
+-------------------------------------------------------------+
         |                      |                     |
         v                      v                     v
+-------------------------------------------------------------+
|                  Prometheus JMX Exporter                    |
|             (Exposes JMX metrics as HTTP endpoints)         |
+-------------------------------------------------------------+
                               |
                               v
+-------------------------------------------------------------+
|                     Prometheus Server                       |
|                 (Scrapes and stores metrics)                |
+-------------------------------------------------------------+
            |                                     |
            v                                     v
+-----------------------+               +---------------------+
|   Grafana Dashboard   |               |    Alertmanager     |
| (Visualizes Metrics)  |               | (Slack, PagerDuty)  |
+-----------------------+               +---------------------+

Setting Up JMX Exporter for Kafka

To expose Kafka JMX metrics to Prometheus, you use the Prometheus JMX Exporter. This is run as a Java Agent alongside the Kafka broker process.

First, create a configuration file named kafka-jmx-config.yml to filter and format the raw JMX metrics. Below is a basic configuration example:

lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_server_replicamanager_underreplicatedpartitions
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
    name: kafka_controller_kafkacontroller_activecontrollercount
  - pattern: 'kafka.server<type=ReplicaManager, name=OfflinePartitionsCount><>Value'
    name: kafka_server_replicamanager_offlinepartitionscount

Next, start your Kafka broker by passing the JMX exporter agent in the KAFKA_OPTS environment variable:

export KAFKA_OPTS="-javaagent:/path/to/jmx_prometheus_javaagent-0.18.0.jar=7071:/path/to/kafka-jmx-config.yml"
bin/kafka-server-start.sh config/server.properties

With this setup, Prometheus can scrape metrics from port 7071 on each broker.

Alerting Rules and Thresholds

Setting up alerts is critical to prevent outages before they affect end-users. Here are standard alerting rules you should configure in Prometheus Alertmanager or your alerting tool of choice:

Critical Alert: Under-Replicated Partitions
Condition: kafka_server_replicamanager_underreplicatedpartitions > 0
Duration: 5 minutes
Action: Page the on-call engineer immediately. This indicates a broker might be down or experiencing severe network/disk issues.
Critical Alert: Offline Partitions
Condition: kafka_server_replicamanager_offlinepartitionscount > 0
Duration: Immediate (0 minutes)
Action: Page the on-call engineer. Data in offline partitions is currently inaccessible to clients.
Warning Alert: Consumer Lag
Condition: kafka_consumergroup_lag > 50000
Duration: 10 minutes
Action: Notify the application team. A downstream consumer service is struggling to process messages fast enough.
Critical Alert: Disk Space Low
Condition: node_filesystem_free_bytes{mountpoint="/var/lib/kafka"} / node_filesystem_size_bytes < 0.15
Duration: 15 minutes
Action: Trigger an alert to clean up logs or expand disk storage before the broker shuts down.

Real-World Use Cases

Use Case 1: E-Commerce Flash Sale Scale-Out

During a flash sale, message production rates spike by 10x. By monitoring the Consumer Lag metric, the platform's orchestration system (like Kubernetes) can automatically spin up additional instances of consumer microservices to handle the load, preventing delays in processing order confirmations and payments.

Use Case 2: Proactive Disk Failure Detection

In a large financial transaction pipeline, a single broker's disk controller begins to fail, causing write latencies to spike. Monitoring the Request Latency and Under-Replicated Partitions metrics allows operations teams to gracefully decommission the failing broker, swap the hardware, and rejoin it to the cluster without dropping a single transaction.

Common Monitoring Mistakes to Avoid

Ignoring Consumer Lag: Only monitoring broker metrics is a common pitfall. If your brokers are healthy but your consumer lag is growing indefinitely, your real-time system is failing to meet its business objectives.
Alert Fatigue: Setting up alerts for normal spikes (like a brief network hiccup causing a temporary under-replicated partition for 10 seconds) leads to alert fatigue. Always use duration windows (e.g., for: 5m) before triggering alerts.
Not Testing Alerts: Teams often find out their alerting system is misconfigured during an actual outage. Regularly simulate broker failures in non-production environments to verify that alerts reach your communication channels (Slack, email, or PagerDuty).

Interview Notes for Developers and DevOps Engineers

Question: What is an Under-Replicated Partition (URP), and why is it dangerous?
Answer: An URP is a partition where the follower replicas have fallen behind the leader broker. It is dangerous because it reduces the fault tolerance of that partition. If the leader broker fails while URP is greater than zero, you risk data loss or partition unavailability.
Question: How do you monitor consumer lag if you do not have access to the consumer application code?
Answer: Consumer lag can be monitored externally using tools like LinkedIn's Burrow or by scraping the broker-side consumer group coordinator metrics, which track the committed offsets of consumer groups against the log-end offsets.
Question: What JVM setting is most critical for Kafka broker stability?
Answer: Garbage Collection settings. Long GC pauses can cause a broker to lose its heartbeat connection to ZooKeeper or the KRaft controller, causing the cluster to assume the broker is dead and trigger unnecessary partition reassignments.

Summary

Monitoring and alerting are foundational pillars of running reliable Apache Kafka clusters. By focusing on critical metrics like Under-Replicated Partitions, Offline Partitions, Disk Space, and Consumer Lag, you can catch infrastructure issues before they turn into severe outages. Standardizing on Prometheus, JMX Exporter, and Grafana provides a robust, scalable observability stack that keeps your event-driven systems running smoothly around the clock.

Monitoring and Alerting in Apache Kafka Clusters

Why Monitoring Kafka is Different

Key Kafka Metrics to Monitor

1. Infrastructure and Host-Level Metrics

2. Broker-Level Metrics (Cluster Health)

3. Producer and Consumer Metrics

4. JVM Metrics

Kafka Monitoring Architecture

Setting Up JMX Exporter for Kafka

Alerting Rules and Thresholds

Real-World Use Cases

Use Case 1: E-Commerce Flash Sale Scale-Out

Use Case 2: Proactive Disk Failure Detection

Common Monitoring Mistakes to Avoid

Interview Notes for Developers and DevOps Engineers

Summary

🔥 Popular Topics

About the Author

Naresh Kumar

Monitoring and Alerting in Apache Kafka Clusters

Why Monitoring Kafka is Different

Key Kafka Metrics to Monitor

1. Infrastructure and Host-Level Metrics

2. Broker-Level Metrics (Cluster Health)

3. Producer and Consumer Metrics

4. JVM Metrics

Kafka Monitoring Architecture

Setting Up JMX Exporter for Kafka

Alerting Rules and Thresholds

Real-World Use Cases

Use Case 1: E-Commerce Flash Sale Scale-Out

Use Case 2: Proactive Disk Failure Detection

Common Monitoring Mistakes to Avoid

Interview Notes for Developers and DevOps Engineers

Summary

Related Topics

🔥 Popular Topics

About the Author

Naresh Kumar