Introduction to Kafka

Naveen Singh
5 min readApr 3, 2023

Whenever there are IPL or ICC tournament, the most appreciated persons are not the cricketers but the Disney+ Hotstar guys that made the streaming possible. But have you wondered how they are able to do so? also how they are able to show 5 min latency to free tier users? I did and then I heard about Kafka.

Kafka developed by LinkedIn then donated to Apache Software Foundation is an open source software which provides a framework for storing, reading and analyzing streaming data.

Simplified Version: Kafka is a stream processing platform that enables applications to publish, consume, and process high volumes of record streams in a fast and durable way. Kafka can be called as Central Nervous System for a architecture.

Why Stream Processing?

Stream processing is a data management technique that involves ingesting a continuous data stream to quickly analyze, filter, transform or enhance the data in real time. Once processed, the data is passed off to an application, data store or another stream processing engine.

What means Real Time?

Real time could mean five minutes for a weather analytics app, millionths of a second for an algorithmic trading app or a billionth of a second for a physics researcher.

How it Works?

  • In layman term, Kafka take data from multiple sources, store in a queue, process the data then passed to application. Here Kafka will process the data without noticing whether application wants it or not but also maintains the order of data.

Terminologies

  • Cluster → Kafka is a distributed system and designed to run on one or more Broker.
  • Broker → Computer, instance or container running the Kafka.
  • Event →An event records the fact that “something happened” in the world or in your business (events are just data). Example → eating ice-cream, watching IPL, publishing new version of a app, etc.
  • Source → From where events are coming. Example → from IPL match, Daily Soaps, etc.
  • Here’s an example event:
Event key: "India Vs England"
Event value: "India won by 6 wickets"
Event timestamp: "March 31, 2023 at 2:06 p.m."
  • Producer → are those client applications that publish (write) events to Kafka.
  • Topic → where events get stored(similar to tables in MySQL DB). A Broker can have multiple Topic. Topic might also be considered as Logs.
  • Partition → Partition breaks a topic into multiple parts(will discuss later).
  • OFFSET → A unique number assigned to every events in each partition.
  • Consumer → are those that subscribe to (read and process) these events.

Architecture

Let’s understand the below image:

Image 1
  • Producers are application that send events to Kafka. Multiple producers can send the events at a time.
  • Kafka is running on a Broker(server) that has a persistent volume attached. Example → Kafka running on EC2 instance which has EBS volume attached. These upcoming events are stored in the disk in the form of Topics. Topics are nothing but similar to tables in SQL only having columns.
  • As a SQL can have as much as tables that can be stored in the given disk size, this also goes for the topics in Kafka.
  • Example → Let’s say a Kafka cluster have 2 broker with disk size 10TB and 20 TB respectively. So the maximum size a topic can be is 20TB that place a limit on Kafka scalability.
  • To resolve this Partitions were introduced. Partitioning breaks a topic into small parts where each can live on different brokers in a cluster. This way reading and writing the events are much easier.
  • The partitioning is done on the basis of Partition key(don’t compare it to primary key of SQL) which decide the partition number for upcoming events. Taking the above event as a example the partition key can be the name that is “India Vs England” or category like “win”. Kafka ensure that the events for this match will be in order even there is a partition. Similarly there will be an another partition named “Real Madrid Vs Barcelona” and so on.
  • But what happen if the event does not contain the partition key then Kafka uses Round Robin mechanism to store the event across the partitions.
  • Kafka ensure that the events having same key will land in same partition and also in order.
  • Each events in partition gets a unique number called OFFSET.
Image 2
  • In image 1, multiple producer can read and write to different partitions at a time. The colors are denoting the partition key and each color are getting stored in a particular partition.
Image 3
  • The Kafka consumer works by issuing “fetch” requests to the brokers leading the partitions it wants to consume(pull mechanism). The consumer specifies its offset in the log with each request and receives back a chunk of log beginning from that position. The consumer thus has significant control over this position and can rewind it to re-consume data if need be.

Features of Kafka

  • Fault Tolerant → Kafka replicates the partitions across the brokers in a cluster that ensure high availability and can be used if a broker die. After replication there is a leader of that partition on to which write operation happens and N-1 replicated partition(followers) used for read operations.
  • Durable → Kafka also store the events for N days(default is 1 week) mentioned in grace policies.

Kafka vs. RabbitMQ

RabbitMQ is a very popular open source message broker, a type of middleware that enables applications, systems, and services to communicate with each other by translating messaging protocols between them.

Because Kafka began as a kind of message broker (and can, in theory, still be used as one) and because RabbitMQ supports a publish/subscribe messaging model (among others), Kafka and RabbitMQ are often compared as alternatives. But, the comparisons aren’t really practical, and they often dive into technical details that are beside the point when choosing between the two. For example, that Kafka topics can have multiple subscribers, whereas each RabbitMQ message can have only one; or that Kafka topics are durable, whereas RabbitMQ messages are deleted once consumed.

The bottom line is:

  • Kafka is a stream processing platform that enables applications to publish, consume, and process high volumes of record streams in a fast and durable way; and
  • RabbitMQ is a message broker that enables applications that use different messaging protocols to send messages to, and receive messages from, one another.

Here’s the practical implementation

Thanks for reading!!!

References:

--

--