Apache Kafka

In this tutorial, we are understanding about Apache Kafka. Apache Kafka is a highly scalable distributed messaging system. Kafka is useful where a large amount of data transfer from one end to another end and it should be scalably, fast, and reliable. We can transfer data between different sources. LinkedIn developed Kafka, and later on, it becomes open-sourced in early 2011.



Kafka Terminologies

Producer

A producer is an application that sends messages or data. For Kafka, it is an array of bytes.

Consumer

A consumer is an application that reads data from the Kafka.

Broker

A broker is the Kafka server or we can say that it is a software service running on virtual/physical machine. Producer and Consumer can't communicate directly. They always communicate through broker service.

Topic

The topic is a logical entity or in simple words, it's a unique name for Kafka stream. The producer produces topics and consumer consumes the topic.

Partitions

Topics are divided into a number of partitions. Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers — each partition can be placed on a separate machine to allow for multiple consumers to read from a topic in parallel.

Offsets

It's nothing but sequence id given to message as they arrive in a partition. These ids are immutable. Offsets are local to partition.

Zookeeper

Apache Zookeeper is a distributed, open-source configuration, synchronization service along with the naming registry for distributed applications. Basically, Kafka – ZooKeeper stores a lot of shared information about Kafka Consumers and Kafka Brokers

All Kafka messages are organized into topics. Messages are sent to a specific topic and Messages are read from a specific topic. A consumer pulls messages off of a Kafka topic while producers push messages into a Kafka topic. Kafka runs in a cluster. Each node in the cluster is called a Kafka broker.

Broker-Partition


Kafka retains all published messages regardless of consumption. The retention period is configurable. The default retention time is 168 hrs or 7 days. The retention period is defined on a per topic basis.

Let's play with Kafka

NOTE: Your local environment must have Java 8+ installed.

STEP 1: GET KAFKA

Download the latest Kafka release and extract it:
$ tar -xzf kafka_2.13-2.6.0.tgz
$ cd kafka_2.13-2.6.0

STEP 2: START THE KAFKA ENVIRONMENT

Terminal-1: Start zookeeper service
$ bin/zookeeper-server-start.sh config/zookeeper.properties
Terminal-2: Start the Kafka broker service
$ bin/kafka-server-start.sh config/server.properties

STEP 3: CREATE A TOPIC TO STORE YOUR EVENTS

Terminal-3: Use terminal 3
$ bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092

STEP 4: DESCRIBE TOPIC

Terminal-3: Use terminal 3
$ bin/kafka-topics.sh --describe --topic quickstart-events --bootstrap-server localhost:9092
Topic:quickstart-events  PartitionCount:1    ReplicationFactor:1 Configs:
    Topic: quickstart-events Partition: 0    Leader: 0   Replicas: 0 Isr: 0

STEP 5: WRITE SOME EVENTS INTO THE TOPIC

Terminal-3: Use terminal 3
$ bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
This is test event1.
test event2.
My test event3.

STEP 6: READ THE EVENTS

Terminal-4: Use terminal 4
$ bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
This is test event1.
test event2.
My test event3.

TERMINATE KAFKA

$ rm -rf /tmp/kafka-logs /tmp/zookeeper

Kafka UseCases

  • Kafka Messaging

  • Website Activity Tracking

  • Kafka Log Aggregation

  • Stream Processing

  • Kafka Event Sourcing

Companies using Kafka

LinkedIn,NetFlix,Twitter,Yahoo,Uber,Coursera,Microsoft,Mozilla,Bing,Oracle,GoldmanSachs and so many...

LinkedIn - Kafka

  • High Volume

  • - 1.4 trillion msg/day
    - 175 TB/day
    - 433 million users

  • Velocity

  • - peak 13 million msg/second

  • Variety

  • - Multiple RDBMS
    - Multiple NoSQL DB - Hadoop, spark, etc

Advantages of Kafka

  • High-throughput

  • Low Latency

  • Fault-Tolerant

  • Durability

  • Scalability

  • High Concurrency

  • Real-time processing

0 comments:

Post a Comment