Dissecting Apache Kafka

Introduction to Kafka: The Need for a Distributed Messaging System

Kafka is a distributed messaging system that plays a crucial role in modern data pipelines by addressing the need for high-throughput, low-latency communication between different services in a scalable and fault-tolerant manner. Traditional messaging systems, such as RabbitMQ and JMS, while capable, often struggle with scaling to handle large volumes of data or ensuring data consistency across distributed systems.

Kafka overcomes these challenges by enabling real-time streaming and providing a durable, distributed event log that acts as a central data hub for all services. It is particularly well-suited for use cases such as log aggregation, stream processing, and event sourcing, where data must be collected, processed, and consumed in real time. Kafka’s ecosystem is designed to provide a seamless experience for managing large-scale data flows.

This includes Kafka itself, which handles message brokering, ZooKeeper (or KRaft in newer versions) for managing the Kafka cluster, Kafka Connect for integrating external systems, and Kafka Streams for stream processing and real-time analytics. Together, these components create a powerful platform for building scalable, resilient, and fault-tolerant data-driven applications.

Kafka Cluster / Brokers, Topics, and Partitions — The Backbone

Broker / Cluster: A Kafka server. Each broker handles part of the data load.
Topic: Logical channel to which producers send messages and consumers read from.
Partition: Unit of parallelism within a topic. Each message goes to one partition.
Replication: For each partition, Kafka can create multiple replicas (one leader + followers) for high availability.

Kafka Data Flow — From Producer to Consumer

Producer sends a record to a topic.
Kafka determines the partition:
- If a key is provided → Hash(key) % partitions.
- If no key → Round-robin or custom partitioner.
The record is stored sequentially in the target partition's log.
Leader broker of that partition writes the record, then replicates it to follower brokers.
Consumers in a consumer group fetch records from their assigned partitions.
Kafka tracks the offset each consumer has read. (Like a bookmark in a book.)

Kafka Partitions — Scaling, Increasing, and Decreasing

A partition in Kafka is a fundamental unit of parallelism and storage. Each topic can be divided into multiple partitions, which helps distribute records across them. Partitions allow Kafka to scale and handle large volumes of data efficiently, enabling high throughput and parallel processing.

When it comes to increasing the number of partitions for an existing topic, Kafka supports this operation after topic creation. This can offer several benefits:

Improved parallelism: More partitions mean more consumers can read from the topic concurrently.
Higher throughput: Producers and consumers can scale independently of each other.

The partition count directly impacts consumer behavior within a consumer group. Here’s how different scenarios play out:

With 3 partitions and 2 consumers, one consumer will handle 2 partitions, and the other will handle 1.
With 3 partitions and 3 consumers, the partitions will be evenly distributed (one-to-one mapping).
With 3 partitions and 4 consumers, one consumer will remain idle since there are more consumers than partitions.
With 3 partitions and 1 consumer, that single consumer will handle all 3 partitions.

Kafka Consumer Group, Offset, Polling, and Auto-Commit Explained

Consumer Group Concept

A consumer group in Kafka is a logical grouping of consumers that work together to consume messages from a topic. Each consumer in a group is assigned a subset of partitions, ensuring that each message is processed only once per group.

Each partition is consumed by exactly one consumer in a group.
Multiple groups can independently consume the same topic without interfering with each other.

Offset in Kafka

Kafka tracks the offset, which is the position of a consumer in a partition — essentially a pointer to the message being read.

Offsets are maintained per partition per consumer group.
By default, offsets are stored in an internal Kafka topic: __consumer_offsets.

You can configure how offsets are committed using auto-commit or manual commit modes.

Auto-Commit

Kafka consumers can be configured to automatically commit offsets at regular intervals using:

This tells the consumer to commit the latest offset after every 5 seconds (by default). While convenient, it may lead to message loss if the consumer fails after receiving a message but before processing it.

Manual Offset Commit

More robust approach: manually committing the offset after processing a message. This ensures at-least-once delivery.

Poll Interval

Kafka consumers use the poll() method to request messages from the broker. You must poll regularly — otherwise, Kafka considers the consumer as dead and triggers a rebalance.

If the consumer doesn't poll within this time, it's removed from the group, and partitions are reassigned.

Rebalancing in Kafka: Why It Happens and How It Affects Consumers

Rebalancing in Kafka is the process of redistributing partitions among consumers within a consumer group. It is triggered when there are changes in the group, such as a consumer joining or leaving, a topic being added or modified, or consumers failing to poll within a set interval.

This process temporarily pauses consumption as Kafka stops message delivery, reassigns partition ownership, and resumes once the new assignments are in place. While necessary for load balancing, rebalancing can cause latency or downtime, especially with stateful or slow-to-rejoin consumers.

Internally, Kafka handles rebalancing through:

Coordinator election to manage the group,
Partition assignment using strategies like range, round-robin, or sticky,
Offset fetching to resume processing,
Consumer resumption from newly assigned partitions.

Example: In a group with 4 partitions and 2 consumers, each consumer handles 2 partitions. If a third joins, Kafka rebalances to spread partitions across all three.

To minimize disruption from rebalancing:

Use sticky assignment to reduce reshuffling.
Adjust session.timeout.ms and heartbeat.interval.ms for better tolerance.
Avoid frequent consumer churn.
Use cooperative rebalancing for smoother transitions (available in newer Kafka versions).

Leader and Replica in Kafka: High Availability Through Replication

In Kafka, every partition of a topic is replicated across multiple brokers to ensure fault tolerance and high availability.

Leader Replica: Handles all read and write requests for the partition.
Follower Replicas: Passive replicas that copy data from the leader.

Only one broker at a time is the leader for a given partition. The remaining replicas are known as followers.

Let’s say you have a topic SendEmailQueue with 3 partitions and a replication factor of 3:

Partition	Leader Broker	Follower Brokers
P0	Broker 1	Broker 2, 3
P1	Broker 2	Broker 1, 3
P2	Broker 3	Broker 1, 2

Each broker is leading one partition and following two others.

What Happens If Leader Fails?

If a leader replica fails, Kafka elects a new leader from the ISR. If no replica is in sync, Kafka will wait until at least one follower catches up — unless unclean.leader.election is enabled (not recommended in production).

Frequently Asked Questions (FAQs) About Kafka

Kafka can be a complex system to understand, especially when you are first diving into its various components and concepts. Here are some of the most frequently asked questions that can help clarify common doubts about Kafka.

What is the difference between Kafka and a traditional messaging queue like RabbitMQ?

Kafka and RabbitMQ are both message brokers, but they have different use cases and design principles:

Kafka is designed for high throughput and distributed data streaming. It stores messages in topics and partitions, and consumers can read messages at their own pace, replaying them if needed.
RabbitMQ is more focused on messaging between services with high reliability and flexible routing patterns. It uses queues for message delivery and is designed for scenarios requiring complex routing and guarantees like exactly-once or at-least-once delivery.

Kafka is generally more suited for log aggregation, stream processing, and big data use cases, while RabbitMQ is preferred for traditional messaging with complex patterns like RPC or pub/sub.

What happens if a Kafka broker goes down?

Kafka has built-in fault tolerance. When a broker goes down, the replicas of the partitions stored on that broker become available through other brokers. Kafka uses the concept of replicas and leader-follower architecture to ensure no data is lost:

The leader replica for each partition will handle read and write operations.
The follower replicas replicate the leader’s data.

If a leader replica is lost due to a broker failure, Kafka will automatically elect a new leader from the available followers. However, if there are no available replicas, the partition may become unavailable until the broker recovers.

What is a Kafka Consumer Group?

A Consumer Group is a group of consumers that work together to consume messages from one or more topics. Kafka ensures that each partition in a topic is consumed by only one consumer within a group. Consumer groups provide scalability and fault tolerance by distributing partition consumption across multiple consumers.

If a consumer fails, other consumers in the group can pick up the partitions the failed consumer was consuming.
Consumer groups allow parallel processing of messages, and each message will only be processed once by a single consumer within the group.

What are Kafka Topics and Partitions?

Topics are logical channels to which producers publish messages and from which consumers consume messages. Topics can be thought of as message categories.
Partitions are the physical storage units within a topic. A topic can have multiple partitions, and messages within a partition are ordered. Partitions enable Kafka to scale horizontally by allowing parallel reads and writes.

Each partition can only be consumed by one consumer at a time in a consumer group, and messages in partitions are stored in offsets that consumers can track.

How does Kafka guarantee message order?

Kafka guarantees message order at the partition level, not across the entire topic. Within a single partition, messages are ordered based on the order in which they were produced. The partition key determines how messages are distributed across partitions:

If you want to preserve message order for a specific key, ensure that all messages with the same key are sent to the same partition.

However, Kafka does not guarantee order across different partitions within a topic.

How does Kafka handle message retention?

Kafka has a retention policy that controls how long messages are stored in a topic. There are two main retention mechanisms:

Time-based retention: Messages are retained for a specified period, after which they are deleted.
Size-based retention: Kafka deletes messages when a topic reaches a specified size limit.

Once messages are deleted, they are no longer available for consumption, but they can be replayed as long as they are within the retention window.

What is Kafka Consumer Lag?

Consumer lag refers to the difference between the latest offset (the last message produced) and the current offset (the last message consumed) for a consumer group in a partition. Lag occurs when consumers are behind in processing messages.

High lag indicates that consumers are not keeping up with the rate of incoming messages.
Kafka provides monitoring tools to track lag, and it’s important to ensure that lag remains low for timely processing.

How do Kafka Producers ensure data durability?

Kafka producers ensure durability through the acknowledgment mechanism:

acks=0: The producer does not wait for any acknowledgment from the broker. This is faster but less reliable.
acks=1: The producer waits for acknowledgment from the leader broker. This ensures that the message is written to at least one broker.
acks=all: The producer waits for acknowledgment from all in-sync replicas. This provides the highest durability but may impact performance.

What is Kafka's Exactly-Once Semantics (EOS)?

Kafka provides exactly-once semantics (EOS) to ensure that a message is neither lost nor duplicated during processing. EOS is achieved by:

Idempotent Producers: Producers are idempotent, meaning that even if they send the same message multiple times, it will only be written once to the topic.
Transactional Producers and Consumers: Kafka supports transactions that allow producers to send messages as part of a single atomic operation. Consumers that process messages in a transaction can ensure that only one message is consumed, even in the case of retries.

Can I change the number of partitions in Kafka?

Yes, you can increase the number of partitions in Kafka, but it is not possible to decrease the number of partitions. Increasing partitions allows Kafka to scale horizontally, distributing the load across more consumers.

However, adding partitions can disrupt consumer offset tracking because Kafka reassigns partitions to consumers. It’s important to handle rebalancing and consumer offsets carefully.

What is the difference between `kafka-console-consumer` and `kafka-console-producer`?

kafka-console-consumer is a command-line tool that allows you to consume messages from a Kafka topic.
kafka-console-producer is a command-line tool that allows you to produce messages to a Kafka topic.

Dissecting Apache Kafka

Introduction to Kafka: The Need for a Distributed Messaging System

Kafka Cluster / Brokers, Topics, and Partitions — The Backbone

Kafka Data Flow — From Producer to Consumer

Kafka Partitions — Scaling, Increasing, and Decreasing