Kafka

Diff with other solutions

  • vs RabbitMQ: Kafka has ~1000x better performance (as it uses unidirectional log without handshake)

  • vs Redis: Redis does not handle streaming data nor has a history of past messages

  • vs Amazon SQS: mainly still performance is worse + it is not cheap

  • vs Amazon SNS: very similar, but more expensive

  • vs Azure Streaming Services: this is not exactly meant for pub/sub in general

  • vs Google pub/sub: same as Amazon SNS, but more expensive

How it is distributed

  • Confluent Cloud

  • Cloudera has its own version

  • Strimzi offers Helm charts to deploy Kafka to K8s

Architecture

Zookeeper

  • Zookeeper manages brokers (keeps list of them)

  • Zookeeper helps in performing leader election for partitions

  • Zookeeper sends notifications to Kafka in case of changes (new topic, broker dies/comes_up)

  • Kafka 2.x can't work without Zookeeper

  • Kafka 3.x can work without Zookeeper (KIP-500) - using Kafka Raft instead

  • Kafka 4.x will not have Zookeeper

  • Zookeeper by design operates with an odd number of servers (1,3,5,7)

Topic vs Queue

Topic has a retention configuration:

  • Days: keep messages of up to N days and then delete

  • Bytes: keep messages until the topic has more than X bytes and then delete

  • None: store all messages from the beginning of time

Kafka Broker Discovery

Each Kafka broker is also called "bootstrap server". Each broker knows about all brokers, tpics and partitions (metadata).

Partitioning

Messages are ordered only within a partition, not across partitions.

Each message in a partition has an id (offset) which is always incremental.

Once data is writtem to a partition, it. can not be changed (immutability).

The partitioner uses the key to distribute the messages to the correct partition.

Consumers read from partitions and have an offset per partition.

More partitions enable more consumers => scalability

Example data distribution

E.g. Topic-A has 3 partitions, Topic-B has 2 partitions.

Replication

Topics in PROD should have replication factor > 1 (3 is better). E.g. Topic-A has 2 partitions and replication factor 2.

Leader for a Partition

  • At any time only ONE broker can be a leader for a given partition

  • Producers can only send data to the broker that is a leader of a partition

  • Consumers will read only from leader (since v2.4+ can read from replica as well)

  • the other brokers will replicate the data

  • => each partition has one leader and multiple ISR (in-sync replica)

Producer

Producer delivery semantics (acks)

Producers can choose to receive acknowledgment of data writes:

  • at most once, acks=0: Producer won't wait for the ack (possible data loss)

what can go wrong
  • at least once, acks=1: Producer will wait for the leader ack (limited data loss)

    • If no acknowledgment is received for the message sent, then the producer will retry sending the messages based on retry configuration. Retries property by default is 0 make sure this is set to desired number or Max.INT.

Idempotent producer

as the producer writes messages to the broker in batches, if that write fails and the producer retries, messages within the batch may be written more than once in Kafka.

However, to avoid duplication, Kafka introduced the feature of the idempotent producer.

Essentially, in order to enable at-least-once semantics in Kafka, we’ll need to:

  • set the property β€œack” to value β€œ1” on the producer side

  • set β€œenable.auto.commit” property to value β€œfalse” on the consumer side.

  • set β€œenable.idempotence” property to value β€œtrueβ€œ

  • attach the sequence number and producer id to each message from the producer

Kafka Broker can identify the message duplication on a topic using the sequence number and producer id.

  • acks=all: Leader + replicas acks (no data loss)

    • there is important property min.insync.replicas if, replication factor = 3, then set property to 2, so that kafka can tolerate failure of 1 broker.

    • always deliver all messages without duplication,

Consumer

  • Consumers act as a group via the consumer group id

  • Each consumer in the group gets assigned a partition, and a partitions is not a shared by two members of a group

    • in N of consumers > N of partitions => some consumers will idle

    • How several applications can read from the same topic:

  • A rebalance in group triggers when a member leaves the group or they have not sent a heartbeat in a long time

  • A rebalance is a stop the world event that ensures all partitions are attended by some consumer in the group

  • Source Connectors - transfer data from a source to Kafka

  • Sink Connectors - from Kafka to a Sink

Standalone

Consumer delivery semantics

  • At least once (usually preferred)

    • offsets are committed after the message is processed

    • if processing goes wrong, the message will be read again

      • can be duplicate processing of the message => make processing idempotent

Idempotent consumers

In order to avoid processing the same event multiple times, we can use idempotent consumers.

Essentially an idempotent consumer can consume a message multiple times but only processes it once.

The combination of the following approaches enables idempotent consumers in at-least-once delivery:

  • The producer assigns a unique messageId to each message.

  • The consumer maintains a record of all processed messages in a database.

  • When a new message arrives, the consumer checks it against the existing messageIds in the persistent storage table.

  • In case there’s a match, the consumer updates the offset without re-consuming, sends the acknowledgment back, and effectively marks the message as consumed.

  • When the event is not already present, a database transaction is started, and a new messageId is inserted. Next, this new message is processed based on whatever business logic is required. On completion of message processing, the transaction’s finally committed

  • At most once

    • offsets committed once a message is received

    • if processing fails, message will be lost (they won't be read again)

    • In order to enable At-Most-Once semantics in Kafka, we’ll need to set β€œenable.auto.commit” to β€œtrue” at the consumer.

  • Exactly once

Local deploy

3 zookeepers 3 brokers

Last updated

Was this helpful?