Kafka – Systems continue to evolve, especially in how they communicate: from a simple message queue as one source of communication, to complex (micro)services receiving and forwarding millions of messages per second.
The History of Messaging
I won’t start all the way back from carrier pigeons, don’t worry. 🐦 Systems started out as simple because the use cases weren’t demanding, and the Single Message Queue was the first messaging approach used at the beginning of the system design era.
Then, the Publish/Subscribe Pattern was born.
The Pub/Sub Pattern
The publish-subscribe (or pub/sub) messaging pattern is a design pattern that provides a framework for exchanging messages that allows for loose coupling and scaling between the sender of messages (publishers) and receivers (subscribers) on topics they subscribe to.
But doesn’t that sound like the Observer Pattern? You are right if you asked yourself that question, however, the main difference is that the Pub/Sub Pattern is loosely coupled — this means that the publishers and subscribers don’t have any knowledge of each others’ implementation, whereas, in the Observer Pattern, the Observers (the Sub equivalent) are aware of each Subject that has a copy of the Observer, and must be implemented in the same single application. Also, pub and sub communicate using messages thus enabling asynchronous communication, whereas, in the Observer Pattern, Subject (Topic) logic will be executed only when an event is fired.
System architecture evolved and systems became more complex to manage — as necessity is the mother of invention, LinkedIn created Kafka to solve the problem it faced. The problem was that the metrics collection for user tracking activities and monitoring applications wasn’t optimized for handling real-time data collection (most of what they were doing was polling data). It also wasn’t automated and required human intervention.
So what is Kafka?
Apache Kafka is an event streaming system used for publishing and consuming messages among services. Specifically:
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Why choose Kafka?
Performance: Kafka is known for high performance, it can handle millions of messages per second.
Scalability and Durability: It is highly scalable due to the fact the topic is split to multiple partitions, allowing handling messages for a specific producer per partition not per topic with some custom configuration. Also, all messages are stored by default and persisted for one week: this is configurable, and Kafka will persist all messages forever based on custom configuration.
Distributed by Design: Kafka is fully distributed by design, with multiple brokers (servers) which can be distributed over multiple Kafka clusters over different zones and regions, making it highly available.
Concepts and Terminologies
Messages are the basic unit in Kafka, which is just a byte array. A message has a key that will allow for handling messages generated with the same key to be forwarded to the same partition. These messages are transferred as batches for more efficiency, where a batch will include the messages having the same topic and partition.
There are three methods to deliver messages in Kafka:
- At Least Once: Where the producer can send the message more than once (in case of failure occurred or any other use case), the consumer is responsible for handling duplicate data.
- At Most Once: The producer sends the message once, and if it failed to be delivered, the message is not sent again.
- Exactly Once: The producer sends the message more than once, however, the consumer receives it only once.
Kafka doesn’t care about the structure of the message, which provides us with great flexibility and performance, but the business layer does: that’s why we need Schemas to define the structure of those messages.
Schemas are a way to define a structure for the messages transferred in Kafka. Below are the schemas supported by Kafka Schema Registry:
- Apache Avro is a data serialization framework developed within Apache’s Hadoop project and is the default format.
- Protocol Buffers (Protobuf) is a method of serializing structured data, developed by Google.
- JSON Schema is a method to describe and validate JSON documents.
Topics and Partitions
Topics are logical naming for grouping messages: think of it as a file containing all the messages with the same topic. A topic can be split into one or more partitions to improve read/write performance.
In the diagram above, you can see four partitions: Producers, Mobile Clients, Robot Services, VR Games, Smart Car — you name it, a producer can be anything you want to forward messages. Once a message is issued by a producer, it will be assigned randomly to a partition, unless a key is specified. Once the message is published, it will be kept for a week by default. This message is immutable (it cannot be changed) and the ID will increment infinitely and will never reset.
For instance, if a mobile client publishes a message and it was assigned to Partition 0, this message will be received and assigned the incremental ID ‘5’: this is also called the offset.
Producers and Consumers
In other event streaming tools like NATS, they are referred to as Publishers and Subscribers. The Producer is the service that publishes events to Kafka ‘brokers’, and Consumers process these events.
Brokers and Clusters
A single Kafka server or node is called a broker. A broker consists of multiple topics, each topic with one or more partitions. Multiple brokers form a Kafka ‘cluster’. Inside a cluster, one broker will be selected as the controller which will be responsible for assigning partitions, and for checking for failures among the brokers.
Topics and Partitions can be distributed among one or more brokers, as demonstrated in the below diagram:
Playing with Kafka
Time to try this out! We’ll keep it simple, and start a Kafka cluster using docker-compose 🐋 that will spin up a Kafka cluster with 3 Kafka brokers. You can find the file at https://github.com/rfashwall/go-kafka/blob/main/docker-compose.yaml.
$ docker-compose up #to run the Kafka cluster
I also used Kafdrop for this, which is a UI web application to view the Kafka cluster — after all, an image is worth a thousand words.
You can download the Kafka binaries here. First up, extract the tar Kafka file; you will find a bin folder that we need.
Create your first topic using kafka-topics.sh — in this command, we specify the topic name, number of partitions, zookeeper server, and replication factor:
$ bin/kafka-topics.sh --create --topic my-first-topic --zookeeper localhost:2181 --partitions 3 --replication-factor 2Created topic my-first-topic.
Now, start a producer:
$ ./kafka-console-producer.sh --broker-list localhost:9092,localhost:9093,localhost:9094 --topic my-first-topic > Hello Kafka
broker-list flag points the producer to the addresses of the brokers that are spun up, and the
topic option specifies the topic to which the messages will be published. Under the command prompt occurred, write some messages as you want.
In a new terminal start a consumer:
$ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --from-beginningHello Kafka
bootstrap-server could be any Kafka broker, and the
from-beginning option will allow the consumer to read all the messages of the specified topic that are stored in the Kafka cluster storage.
And voila 🎉 — you have set up your first Kafka cluster, published a message, and consumed it!
In the next guide in this series, we’ll take a deep dive into building a real-world microservice application using Golang that will include a Kafka client, a producer, and a consumer. Stay tuned!