Consumer lag

Kafka consumer lag is the difference between the last offset stored by the broker and the last committed offset for that partition.

To enable resilience, Kafka consumers keep track of the last position in a partition from where it is read. This helps consumers to begin again from the position they left off in case of unfortunate situations like crashes. This is called consumer offset. Consumer offset is stored in a separate Kafka topic.

Reasons

Incoming traffic increase

Data not equally balanced in partitions (some partitions are heavier)

Slow processing jobs

Errors in code and pipeline components

How to check

$KAFKA_HOME/bin/kafka-consumer-groups.sh  --bootstrap-server <> --describe --group <group_name>
GROUP          TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    OWNER
ub-kf          test-topic      0          15              17              2      ub-kf-1/127.0.0.1  
ub-kf          test-topic      1          14              15              1         ub-kf-2/127.0.0.1

It details each partition’s ‘current offset’, ‘log end offset’, and lag. The ‘current offset’ of a partition is the last committed message of the consumer dealing with that partition. The ‘log end offset' is the highest offset in that partition. In other words, it represents the offset of the last message written to that partition. The difference between these two is the amount of consumer lag.

How to monitor

Tool Burrow

Strategies for addressing Kafka consumer lag

Optimizing consumer processing logic

The most straightforward solution to solve the consumer lag problem is to identify the bottleneck in the processing logic and make it more efficient. Of course, this is easier said than done. A possible approach is to identify any synchronous blocking operation in the code and make it asynchronous if the use case allows it.

Modifying partition count

Partition assignment is a crucial factor that affects consumer lag. A situation where only one partition’s lag is higher than others points to a problem in the data distribution. This can be solved using a custom partition key or adding more partitions. The configuration of partitions and consumers will depend on the use case.

Using an application-specific queue to limit rates

Sometimes the processing logic is a time-consuming operation and can't be optimized. In these cases, developers should consider using an application-specific queue so that consumers can just add messages to the application queue. Once the messages are pushed to the queue, the consumer can wait till they are processed before accepting new messages from Kafka. The second queue will help to divide the messages from a single partition among multiple processing instances.

Tuning consumer configuration parameters

Developers can use various consumer parameters to adjust how frequently consumers pull messages from partitions or how much data is fetched in one operation. The parameter ‘fetch.max.wait.ms’ controls the time the consumer waits before accepting new data from the partition. The ‘fetch.max.bytes’ and ‘fetch.min.bytes’ parameters control the maximum and minimum amount of data fetched by consumers in one operation.

Another critical element of reducing consumer lag is to minimize rebalancing operations as much as possible. Each rebalance operation blocks the consumers for some time and can increase the consumer lag. The parameters ‘max.poll.interval.ms’, ‘session.timeout.ms’, and ‘heartbeat.interval.ms’ control the situations where the broker decides that a consumer has been inactive for too long and triggers a rebalance operation. Tuning these values can help in reducing rebalance operations.

Last updated