What is Apache Kafka?
Apache Kafka is a distributed streaming platform. This essentially means that Kafka has following capabilities:
- Publish and subscribe to the stream of records just like how message queue (e.g. RabbitMQ) works
- Store streams of records for specified time. Kakfa is not a replacement for database or a logging platform. However, kafka might be one component of a logging platform
- Process stream of records. We may join two source of related data to produce an output stream of desired record
When do we use Kafka?
- When we want to build a real time data pipeline to share records reliably between applications. For example: Suppose an e-commerce company streams booking data to be used by different teams like accounting, data science to extract meaningful insight or maintain records
- When we want to build a real time data streaming application that transforms or triggers some action. For example: Suppose an e-commerce company wants to join consumer search behaviour to buying decision, we would join browsing records with selling records
As the name suggests, producer is the application that emits the records.
Consumer is the application that receives the records.
Brokers are system(or in simpler words servers) responsible for maintaining the published data. One broker may have zero or more topics.
A kafka system having more than one broker is known as cluster. Number of brokers in a cluster can be increased without any downtime. Cluster is used to persist and replicate data.
A topic is like a category where a producer publishes and stores records. A consumer subscribes to a particular topic to read messages. Messages are published to particular topic where they are retained for a configurable duration. Any topic can be subscribed by any number of consumers.
Kafka topics are divided into a number of partitions. Any record written to a particular topic goes to particular partition. Each record is assigned and identified by an unique offset. Replication is implemented at partition level. The redundant unit of topic partition is called replica. The logic that decides partition for a message is configurable. Partition helps in reading/writing data in parallel by splitting in different partitions spread over multiple brokers. Each replica has one server acting as leader and others as followers. Leader handles the read/write while followers replicate the data. In case leader fails, any one of the followers is elected as the leader.
Any consumer can read from any topic from any particular offset or just from now or from start. Consumers can join a group called consumer group. A consumer group has set of consumers subscribed to a particular topic. Kafka ensures any record in a topic is only read by any one of the consumers of a particular consumer group.
Consumer pulls records from topic partition. Each consumer is assigned a set of partition to consume data. Kafka can support multiple consumers with little over head. By using consumers in a group, kafka parallelise read and thus supports read at very high throughput. Number of consumers that can be used to read the records is limited by the number of the partitions in a topic. Kafka works on pull model, thereby it only sends data to consumer only when consumer wants to.