Introduction to Kafka Connect: Source and Sink Connectors

In a modern data architecture, Apache Kafka acts as the central nervous system. However, for Kafka to be useful, it must seamlessly integrate with external systems like relational databases, NoSQL datastores, search indexes, key-value stores, and cloud storage. Writing custom producer and consumer code for every single external system is repetitive, error-prone, and difficult to scale.

This is where Kafka Connect comes in. Kafka Connect is a free, open-source component of Apache Kafka that provides a framework for scalably and reliably streaming data between Apache Kafka and other data systems. It simplifies the integration process by offering ready-to-use connectors, eliminating the need to write custom integration code.

What is Kafka Connect?

Kafka Connect is a robust integration framework designed to run continuously as a dedicated service. It standardizes the integration of data sources and destinations with Kafka. Instead of writing boilerplate code to read from a database and write to Kafka, you simply configure a pre-built connector.

The architecture of Kafka Connect revolves around a simple flow: data is pulled from a source system and pushed into Kafka topics, or pulled from Kafka topics and pushed into a target sink system. The following diagram illustrates this pipeline:

┌───────────────────────────┐
│  Source System (e.g. SQL) │
└─────────────┬─────────────┘
              │ (Ingest)
              ▼
┌───────────────────────────┐
│  Kafka Source Connector   │
└─────────────┬─────────────┘
              │ (Publish)
              ▼
┌───────────────────────────┐
│     Kafka Topic (Log)     │
└─────────────┬─────────────┘
              │ (Subscribe)
              ▼
┌───────────────────────────┐
│    Kafka Sink Connector   │
└─────────────┬─────────────┘
              │ (Export)
              ▼
┌───────────────────────────┐
│  Target System (e.g. S3)  │
└───────────────────────────┘

Core Concepts: Workers, Tasks, and Converters

To understand how Kafka Connect operates under the hood, we must look at its core building blocks: Connectors, Tasks, Workers, and Converters.

Connectors: The high-level logical units that define where data comes from or where it goes. They do not execute the data copying themselves; instead, they break the work down into smaller, parallel configurations called Tasks.
Tasks: The actual execution units that move data. Source Tasks read from external systems and return a list of source records to be written to Kafka. Sink Tasks receive records from Kafka and write them to the target system. Tasks do not store state; they are completely stateless.
Workers: The running JVM processes that execute the Connectors and Tasks. Workers handle HTTP requests for configuration, balance tasks across available resources, and manage fault tolerance. Workers can run in two modes: Standalone (single process, ideal for testing) or Distributed (multiple clustered processes, ideal for production).
Converters: Components that translate data between the internal Kafka Connect format and the byte format stored in Kafka topics. Common converters include JSON Converter, Avro Converter, Protobuf Converter, and String Converter.

Source Connectors vs. Sink Connectors

Kafka Connect categorizes data movement into two primary directions: Source and Sink.

1. Source Connectors

Source Connectors pull data from an external system and write it to one or more Kafka topics. Examples include reading transactional logs from a database (Change Data Capture or CDC), tailing application log files, or polling a REST API. The source connector is responsible for tracking offsets in the external system so it can resume from where it left off in case of a failure.

2. Sink Connectors

Sink Connectors read data from Kafka topics and deliver it to an external system. Examples include indexing Kafka records into Elasticsearch for search, writing records to Amazon S3 for long-term archiving, or updating a relational database table. Sink connectors leverage Kafka's consumer group protocol to coordinate partition assignments and track offsets.

Step-by-Step Configuration Examples

Kafka Connect uses simple JSON configurations to define integrations. Let us look at two practical configuration examples using the built-in FileStream connectors.

Example 1: FileStreamSourceConnector Configuration

This configuration reads data from a local text file named application.log and publishes each line as a new message to a Kafka topic named log-analysis-topic.

{
  "name": "local-file-source",
  "config": {
    "connector.class": "org.apache.kafka.connect.file.FileStreamSourceConnector",
    "tasks.max": "1",
    "file": "/var/log/application.log",
    "topic": "log-analysis-topic",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter"
  }
}

Example 2: FileStreamSinkConnector Configuration

This configuration reads data from the Kafka topic named log-analysis-topic and appends the messages to a local file named output-archive.txt.

{
  "name": "local-file-sink",
  "config": {
    "connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector",
    "tasks.max": "1",
    "file": "/var/log/output-archive.txt",
    "topics": "log-analysis-topic",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter"
  }
}

Real-World Use Cases

Database Replication (Change Data Capture): Using a source connector like Debezium to capture row-level changes (INSERT, UPDATE, DELETE) from databases like MySQL or PostgreSQL and streaming those changes to Kafka topics in real-time.
Search Indexing: Using an Elasticsearch Sink Connector to automatically index data written to Kafka topics, making it instantly searchable by downstream applications.
Data Lake Ingestion: Using an Amazon S3 or Google Cloud Storage Sink Connector to batch stream data from Kafka topics into cloud object stores for cold storage, analytics, or machine learning pipelines.
Legacy System Integration: Pulling mainframe flat files or legacy MQ messages into Kafka to modernize the backend infrastructure without modifying legacy application code.

Common Mistakes and How to Avoid Them

Running Standalone Mode in Production: Standalone mode is simple to set up but lacks fault tolerance and scalability. If the single worker process dies, your integration stops. Always use Distributed Mode in production environments to ensure automatic task rebalancing and high availability.
Mismatched Converters: A very common error is configuring a sink connector with a different converter than the one used by the producer or source connector. For example, if data is written to Kafka as Avro, but the sink connector is configured to read it using the JSON Converter, the connector will fail with serialization errors. Always match your serialization and deserialization formats.
Ignoring Schema Registry for Structured Data: When using structured formats like Avro or Protobuf, always integrate Kafka Connect with a Schema Registry. Hardcoding schemas inside JSON payloads increases network overhead and makes schema evolution difficult to manage.
Over-allocating Tasks: Setting tasks.max to a number higher than your physical partitions (for sink connectors) or available source partitions will result in idle tasks. Match your task count strategically to your partition count and available resources.

Interview Notes and Technical Questions

Question: How does Kafka Connect achieve fault tolerance in Distributed Mode?
Answer: In distributed mode, workers form a cluster using Kafka's group coordination protocol (similar to consumer groups). If a worker node crashes, the remaining workers detect its absence, trigger a rebalance, and automatically redistribute the tasks that were running on the failed worker to the healthy workers.
Question: What is the difference between a Connector and a Task?
Answer: A Connector is a logical configuration template that defines the integration logic, parameters, and how to split the work. A Task is the actual runtime thread executing the data copy. A single Connector can spawn multiple Tasks to parallelize the workload.
Question: Why do we need Converters in Kafka Connect?
Answer: Kafka Connect represents data internally using a generic, strongly-typed data structure (Connect Records). Converters are responsible for serializing this internal representation into bytes when writing to Kafka, and deserializing bytes back into Connect Records when reading from Kafka.

Summary

Kafka Connect is a powerful, declarative framework that bridges the gap between Apache Kafka and external data systems. By leveraging Source Connectors to ingest data and Sink Connectors to export data, developers can build scalable, fault-tolerant, and real-time data integration pipelines with zero custom code. Understanding the relationship between Workers, Tasks, and Converters is key to successfully deploying and maintaining Kafka Connect in production environments.

Introduction to Kafka Connect: Source and Sink Connectors

What is Kafka Connect?

Core Concepts: Workers, Tasks, and Converters

Source Connectors vs. Sink Connectors

1. Source Connectors

2. Sink Connectors

Step-by-Step Configuration Examples

Example 1: FileStreamSourceConnector Configuration

Example 2: FileStreamSinkConnector Configuration

Real-World Use Cases

Common Mistakes and How to Avoid Them

Interview Notes and Technical Questions

Summary

🔥 Popular Topics

About the Author

Naresh Kumar

Introduction to Kafka Connect: Source and Sink Connectors

What is Kafka Connect?

Core Concepts: Workers, Tasks, and Converters

Source Connectors vs. Sink Connectors

1. Source Connectors

2. Sink Connectors

Step-by-Step Configuration Examples

Example 1: FileStreamSourceConnector Configuration

Example 2: FileStreamSinkConnector Configuration

Real-World Use Cases

Common Mistakes and How to Avoid Them

Interview Notes and Technical Questions

Summary

Related Topics

🔥 Popular Topics

About the Author

Naresh Kumar