Published: 2026-06-01 โ€ข Updated: 2026-06-02

Real-Time Stream Processing with Kafka Streams API

In modern data architectures, processing data as it arrives is a critical requirement. Traditional batch processing systems introduce latency because they process data in chunks. The Apache Kafka Streams API solves this problem by allowing developers to build real-time, highly scalable, and fault-tolerant stream processing applications directly on top of Kafka. This guide will take you from the core concepts of stream processing to building your first stateful Kafka Streams application in Java.

What is the Kafka Streams API?

The Kafka Streams API is a lightweight, client-side library for building applications and microservices where the input and output data are stored in Kafka topics. Unlike other stream processing frameworks like Apache Spark Streaming or Apache Flink, Kafka Streams does not require a separate, dedicated execution cluster. It is a standard Java library that runs inside your JVM, making it incredibly easy to integrate with existing deployment pipelines, containerization tools, and cloud environments.

Key Features of Kafka Streams

  • No Extra Infrastructure: It runs as a library within your Java application. You do not need to set up or manage a separate processing cluster.
  • Elasticity and Scalability: It dynamically scales processing based on the partitions in your Kafka topics. You can run multiple instances of your application, and Kafka will automatically distribute the workload.
  • Fault Tolerance: It uses Kafka's native replication and consumer group protocols to handle failures gracefully. If an instance dies, another instance takes over its tasks.
  • Exactly-Once Processing: It supports exactly-once processing guarantees, ensuring that each record is processed precisely once, even in the event of system failures.

The Core Architecture: Stream Processing Topology

A stream processing application built with Kafka Streams defines its computational logic using a processor topology. A topology is a directed acyclic graph (DAG) of stream processors (nodes) connected by streams (edges).

+-------------------------------------------------------------+
|                  Stream Processing Topology                 |
|                                                             |
|   [ Kafka Input Topic ]                                     |
|             |                                               |
|             v                                               |
|     ( Source Processor )                                    |
|             |                                               |
|             v                                               |
|    ( Stream Processor ) <---> [ Local State Store (RocksDB) ]|
|             |                                               |
|             v                                               |
|      ( Sink Processor )                                     |
|             |                                               |
|             v                                               |
|   [ Kafka Output Topic ]                                    |
+-------------------------------------------------------------+
    

The topology consists of three primary components:

  • Source Processor: A node that consumes data from one or more Kafka topics and forwards them to downstream processors.
  • Stream Processor: A node that receives input records, applies transformation logic (such as filtering, mapping, or aggregating), and optionally produces new records.
  • Sink Processor: A node that takes processed records from upstream nodes and writes them to a target Kafka topic.

Stream-Table Duality: KStream vs. KTable

One of the most powerful concepts in Kafka Streams is the duality between streams and tables. Understanding when to use a KStream versus a KTable is fundamental to designing correct streaming applications.

KStream (Record Stream)

A KStream represents an insert-only stream of data. Every incoming record is treated as an independent event. For example, in a stream of credit card transactions, each transaction is a unique historical event. If two transactions have the same card ID, they do not overwrite each other; both are preserved in the stream.

KTable (Changelog Stream)

A KTable represents an update-only stream of data, functioning like a database table. Each record is treated as an upsert (insert or update). If a record with an existing key arrives, the new value overwrites the old value. For example, in a stream of user profile updates, only the latest location or email address for a user ID matters.

GlobalKTable

Similar to a KTable, a GlobalKTable populates a local cache with data from a topic. However, while a standard KTable partitions data across application instances, a GlobalKTable populates the entire dataset on every single running instance. This is highly useful for joining a high-volume stream with a low-volume lookup table (such as joining transactions with user details).

Step-by-Step Java Implementation

Let us build a real-time stream processing application in Java. This application will read transaction events from an input topic, filter out transactions below $100 (stateless transformation), and write the high-value transactions to an output topic.

Step 1: Define Maven Dependencies

To use Kafka Streams, you must include the Kafka Streams client library in your Java project.

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-streams</artifactId>
    <version>3.6.0</version>
</dependency>
    

Step 2: Writing the Application Code

Below is the complete, self-contained Java class that configures and starts the stream processing topology.

package com.example.streams;

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.KStream;

import java.util.Properties;

public class TransactionFilterApp {
    public static void main(String[] args) {
        // 1. Define configuration properties
        Properties config = new Properties();
        config.put(StreamsConfig.APPLICATION_ID_CONFIG, "transaction-filter-service");
        config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
        config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Double().getClass().getName());

        // 2. Build the stream processing topology
        StreamsBuilder builder = new StreamsBuilder();
        
        // Consume from input topic "raw-transactions"
        KStream<String, Double> rawTransactions = builder.stream("raw-transactions");

        // Filter transactions greater than $100.00
        KStream<String, Double> highValueTransactions = rawTransactions.filter(
            (key, value) -> value != null && value > 100.00
        );

        // Write the filtered stream to output topic "high-value-transactions"
        highValueTransactions.to("high-value-transactions");

        // 3. Create the Topology object
        Topology topology = builder.build();

        // 4. Initialize and start the Kafka Streams client
        KafkaStreams streams = new KafkaStreams(topology, config);
        
        // Add a shutdown hook to close the stream gracefully
        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));

        System.out.println("Starting Kafka Streams Application...");
        streams.start();
    }
}
    

Stateful Processing and Windowing

While filtering and mapping are stateless (each record is processed independently), many real-world use cases require stateful operations. Stateful processing allows you to aggregate, join, or window events over time.

When you perform stateful operations, Kafka Streams uses an embedded database called RocksDB to store state locally on the disk of your application instance. This ensures lightning-fast read and write access. To guarantee fault tolerance, local state changes are continuously backed up to a highly durable, internal Kafka topic called a changelog topic.

Windowing Concepts

Windowing allows you to group stateful operations by time. Kafka Streams supports several types of windows:

  • Tumbling Windows: Fixed-size, non-overlapping, contiguous time intervals. For example, a 5-minute tumbling window will group events from 12:00 to 12:05, then 12:05 to 12:10.
  • Hopping Windows: Fixed-size, overlapping time intervals. For example, a 5-minute window that advances (hops) every 1 minute.
  • Session Windows: Dynamically sized windows based on periods of inactivity. A session window closes when no new events arrive within a specified inactivity gap.

Real-World Use Cases

  • Real-Time Fraud Detection: Credit card companies use Kafka Streams to analyze incoming transaction flows. By joining transaction streams with user profile tables, they can flag transactions occurring in different countries within minutes of each other.
  • IoT Sensor Monitoring: Industrial plants stream millions of temperature readings from machinery. Kafka Streams aggregates these readings using tumbling windows to detect overheating patterns and trigger emergency shutdowns.
  • E-commerce Order Tracking: E-commerce platforms use stateful joins to combine "Order Placed" events with "Payment Successful" events to update order fulfillment dashboards in real time.

Common Mistakes to Avoid

  • Forgetting to Configure Serdes: Kafka Streams requires Serializers and Deserializers (Serdes) for every data type. If your keys or values do not match the configured default Serdes, your application will throw a serialization exception at runtime. Always define explicit Serdes when performing operations like map or join.
  • Ignoring Local State Directory Cleanup: By default, Kafka Streams writes local RocksDB state files to /var/tmp/kafka-streams. In production environments, ensure this path points to a persistent, high-performance SSD volume with proper read/write permissions.
  • Modifying Keys in Joins Without Re-keying: If you modify the key of a stream using selectKey() or map(), Kafka Streams must re-partition the data before performing a join or aggregation. Forgetting this can lead to data being routed to the wrong partition, resulting in silent data loss or incorrect aggregations.

Interview Notes: Key Technical Questions

  • What is the difference between KStream and KTable? A KStream is an insert-only stream representing a sequence of independent facts. A KTable is an update-only stream representing the latest state of a dataset, where new records overwrite old records with the same key.
  • How does Kafka Streams handle application failures? Kafka Streams relies on Kafka's consumer group management. If an application instance fails, Kafka triggers a rebalance, and the remaining instances take over the processing partitions. Stateful instances rebuild their local RocksDB state by reading from the corresponding changelog topics.
  • Why would you choose Kafka Streams over Apache Spark? Kafka Streams is a lightweight library, not a framework. It does not require a cluster manager (like YARN or Mesos) or dedicated infrastructure. If your data source and destination are both in Kafka, Kafka Streams is significantly easier to develop, deploy, and maintain.
  • What is the role of RocksDB in Kafka Streams? RocksDB is an embedded, high-performance key-value store used by Kafka Streams to maintain local state for operations like joins, windowing, and aggregations without making expensive network calls to external databases.

Summary

The Kafka Streams API is an exceptional tool for developers building real-time, event-driven applications on top of Apache Kafka. By combining the simplicity of a Java library with powerful stateful capabilities like KTables, windowing, and RocksDB state stores, it enables low-latency processing at scale. When designing your topologies, pay close attention to the stream-table duality, configure your Serdes correctly, and leverage windowing to extract meaningful insights from continuous streams of data.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile