What is Apache Kafka?
Apache Kafka is a distributed streaming platform. This essentially means that Kafka has following capabilities:
- Publish and subscribe to the stream of records just like how message queue (e.g. RabbitMQ) works
- Store streams of records for specified time. Kakfa is not a replacement for database or a logging platform. However, kafka might be one component of a logging platform
- Process stream of records. We may join two source of related data to produce an output stream of desired record
When do we use Kafka?
- When we want to build a real time data pipeline to share records reliably between applications. For example: Suppose an e-commerce company streams booking data to be used by different teams like accounting, data science to extract meaningful insight or maintain records
- When we want to build a real time data streaming application that transforms or triggers some action. For example: Suppose an e-commerce company wants to join consumer search behaviour to buying decision, we would join browsing records with selling records
Terminologies
Producer
As the name suggests, producer is the application that emits the records.
Consumer
Consumer is the application that receives the records.
Broker
Brokers are system(or in simpler words servers) responsible for maintaining the published data. One broker may have zero or more topics.
Kafka Cluster
A kafka system having more than one broker is known as cluster. Number of brokers in a cluster can be increased without any downtime. Cluster is used to persist and replicate data.
Topic
A topic is like a category where a producer publishes and stores records. A consumer subscribes to a particular topic to read messages. Messages are published to particular topic where they are retained for a configurable duration. Any topic can be subscribed by any number of consumers.
Partition
Kafka topics are divided into a number of partitions. Any record written to a particular topic goes to particular partition. Each record is assigned and identified by an unique offset. Replication is implemented at partition level. The redundant unit of topic partition is called replica. The logic that decides partition for a message is configurable. Partition helps in reading/writing data in parallel by splitting in different partitions spread over multiple brokers. Each replica has one server acting as leader and others as followers. Leader handles the read/write while followers replicate the data. In case leader fails, any one of the followers is elected as the leader.
Consumer Groups
Any consumer can read from any topic from any particular offset or just from now or from start. Consumers can join a group called consumer group. A consumer group has set of consumers subscribed to a particular topic. Kafka ensures any record in a topic is only read by any one of the consumers of a particular consumer group.
Consumer pulls records from topic partition. Each consumer is assigned a set of partition to consume data. Kafka can support multiple consumers with little over head. By using consumers in a group, kafka parallelise read and thus supports read at very high throughput. Number of consumers that can be used to read the records is limited by the number of the partitions in a topic. Kafka works on pull model, thereby it only sends data to consumer only when consumer wants to.
Follow the post to tutorial for Kafka producer and consumer step by step hands on example.
Further Reading:
If you liked this article and would like one such blog to land in your inbox every week, consider subscribing to our newsletter: https://skillcaptain.substack.com