<--- Back to all resources
Kafka Consumer Lag: Causes, Debugging, and Fixes
Consumer lag is the most common Kafka operational issue. Learn what causes it, how to measure it, and practical strategies to bring it under control.
Consumer lag is the single most common operational headache in Kafka. It shows up in monitoring dashboards, triggers pages at 3 AM, and sparks debates about whether to add partitions or rewrite the consumer. Before throwing hardware at the problem, it pays to understand what lag actually is, why it happens, and which fixes work for which root causes.
What Consumer Lag Actually Means
Every Kafka partition maintains a sequence of messages, each identified by an offset (a monotonically increasing integer). When a producer writes a message, the partition’s log-end offset advances. When a consumer reads and commits a message, its committed offset advances.
Consumer lag is the gap between those two numbers:
Lag = Log-End Offset - Consumer Committed Offset
Here is a simplified view of a single partition:
Partition 0:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
^ ^
Consumer Offset (6) Log-End Offset (12)
Lag = 12 - 6 = 6 messages
In a topic with multiple partitions, total consumer group lag is the sum of lag across all assigned partitions. A consumer group with 8 partitions each lagging by 500 messages has a total lag of 4,000.
Some lag is always present. Producers and consumers operate asynchronously, so there is always a small gap between when a message is written and when it gets processed. The question is whether that gap is stable or growing. Stable lag of a few hundred messages is normal in most systems. Lag that climbs continuously means consumers are falling behind, and the backlog will keep growing until something changes.
Common Causes of Consumer Lag
Slow Message Processing
This is the most frequent cause. If your consumer takes 50ms to process each message and you receive 1,000 messages per second on a single partition, one consumer needs 50 seconds to work through what arrived in one second. The math doesn’t work.
Typical slow-processing culprits include:
- Synchronous database writes per message instead of batching
- HTTP calls to external APIs inside the processing loop
- Heavy transformations (parsing large JSON, running regex, computing aggregations)
- Locking or contention in multi-threaded consumers
Partition Skew (Hot Partitions)
If your producer uses a message key for partitioning, and certain keys appear far more often than others, some partitions will receive disproportionate traffic. One partition might get 80% of the messages while the others sit nearly idle.
The consumer assigned to the hot partition falls behind while the other consumers have nothing to do. Total lag concentrates in one or two partitions rather than being spread evenly.
Consumer Group Rebalancing
Every time a consumer joins or leaves a group, Kafka triggers a rebalance. During a rebalance, all consumers in the group stop processing while partition assignments are redistributed. In older protocol versions, this stop-the-world pause can last seconds to minutes depending on group size.
Frequent rebalancing is often caused by:
- Consumers exceeding
session.timeout.msdue to slow processing - Consumers exceeding
max.poll.interval.msbetween poll() calls - Unstable deployments where consumers restart repeatedly
- Network blips that make the group coordinator think a consumer died
Each rebalance introduces a processing gap where lag grows.
Garbage Collection Pauses
JVM-based consumers (which covers most Kafka client libraries) are subject to GC pauses. A long GC pause can cause a consumer to miss its heartbeat deadline, which triggers a rebalance. Even without triggering a rebalance, a 2-second GC pause means 2 seconds of zero processing, during which lag accumulates.
This is especially common with consumers that hold large in-memory state, use default JVM heap settings, or process large messages that create heavy object churn.
Network and Infrastructure Bottlenecks
Consumers fetch messages over the network. If the network between your consumers and brokers is saturated, or if the brokers themselves are under disk I/O pressure, fetch requests will be slow. You will see high fetch latency and low throughput even though the consumer processing logic is fast.
Cross-region consumption (consumers in us-east-1 reading from brokers in eu-west-1, for example) adds latency to every fetch cycle.
Deserialization Overhead
If your messages use a schema (Avro, Protobuf, JSON Schema) and the consumer must deserialize each one, the cost of deserialization can become significant at high throughput. Schema registry lookups, especially uncached ones, add latency per message.
Too Few Consumers for the Partition Count
Kafka’s parallelism model is partition-based. Within a consumer group, each partition is assigned to exactly one consumer. If you have 20 partitions and only 2 consumers, each consumer handles 10 partitions serially. Adding more consumers (up to the partition count) distributes the load.
Measuring and Monitoring Lag
The CLI: kafka-consumer-groups.sh
The fastest way to check lag is the built-in command-line tool:
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--describe \
--group my-consumer-group
Output looks like this:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
my-consumer-group orders 0 145230 145235 5 consumer-1-abc-xyz /10.0.1.12 consumer-1
my-consumer-group orders 1 298100 312400 14300 consumer-2-def-uvw /10.0.1.13 consumer-2
my-consumer-group orders 2 201800 201805 5 consumer-3-ghi-rst /10.0.1.14 consumer-3
my-consumer-group orders 3 176500 176504 4 consumer-1-abc-xyz /10.0.1.12 consumer-1
In this example, partition 1 has a lag of 14,300 while the others are nearly caught up. That points to either a hot partition or a slow consumer on that assignment.
To list all consumer groups:
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--list
To see the overall state of a group:
kafka-consumer-groups.sh \
--bootstrap-server localhost:9092 \
--describe \
--group my-consumer-group \
--state
JMX Metrics
Kafka consumers expose several useful metrics via JMX. The most important ones for lag monitoring:
| Metric | MBean | What It Tells You |
|---|---|---|
records-lag-max | kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id} | Highest lag across all assigned partitions |
records-lag | kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id},topic={t},partition={p} | Lag for a specific partition |
fetch-rate | kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id} | Fetch requests per second |
records-consumed-rate | kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id} | Messages consumed per second |
commit-rate | kafka.consumer:type=consumer-coordinator-metrics,client-id={id} | Offset commits per second |
Expose these via a JMX exporter to Prometheus, then build Grafana dashboards showing lag over time per partition.
Burrow for Automated Lag Evaluation
LinkedIn’s open-source tool Burrow goes beyond raw lag numbers. It evaluates whether a consumer is healthy by looking at the lag trend over a sliding window, not just the current value. Burrow classifies consumers as OK, WARNING, or ERR based on whether lag is decreasing, stable, or increasing.
Burrow exposes an HTTP API:
# Get consumer group status
curl http://localhost:8000/v3/kafka/local/consumer/my-consumer-group/status
# Get lag for a specific topic
curl http://localhost:8000/v3/kafka/local/consumer/my-consumer-group/topic/orders
This is especially useful for alerting. A raw lag threshold (alert if lag > 10,000) generates false positives during normal traffic spikes. Burrow’s trend-based evaluation avoids that.
A Debugging Workflow
When you get paged for consumer lag, here is a systematic approach.
Step 1: Is lag growing or stable?
Check lag at two points a minute apart. If it is growing, the problem is active. If it is stable (even if high), the consumer might be catching up after a restart or deployment.
# Check lag now
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group $GROUP
# Wait 60 seconds, check again
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group $GROUP
Compare the LAG column. If CURRENT-OFFSET is advancing but LAG is still growing, the consumer is working but not fast enough.
Step 2: Which partitions are lagging?
If lag is concentrated in one or two partitions, you likely have a hot partition (key skew) or one slow consumer instance. If lag is uniform across all partitions, the problem is global (not enough throughput, a shared downstream dependency is slow, etc.).
Step 3: Check consumer group membership.
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group $GROUP --members
Verify that the expected number of consumers are in the group. If a consumer dropped out, its partitions were reassigned to the remaining consumers, increasing their load.
Step 4: Look at processing times.
Add instrumentation to your consumer code that logs or emits metrics for processing time per message or per batch. If the p99 processing time is 200ms and you are polling 500 records at a time, each poll cycle takes at least 100 seconds. That number needs to be lower than max.poll.interval.ms or you will trigger rebalances.
Step 5: Check for rebalancing.
Search consumer logs for rebalance or revoked. Frequent rebalancing is both a symptom and a cause of lag. Each rebalance creates a processing pause, which increases lag, which can cause the consumer to exceed max.poll.interval.ms, which triggers another rebalance.
Step 6: Inspect GC logs.
If you are running JVM consumers, check for long GC pauses. Enable GC logging with:
-Xlog:gc*:file=/var/log/kafka-consumer-gc.log:time,uptime,level,tags
Look for pauses longer than your session.timeout.ms setting.
Fixing Consumer Lag
Scale Consumers (Up to Partition Count)
The simplest fix: add more consumer instances. Each new consumer takes over some partitions from existing consumers, distributing the processing load.
Partitions: 12
Current consumers: 3 (each handles 4 partitions)
After scaling to 6: each handles 2 partitions
After scaling to 12: each handles 1 partition (maximum parallelism)
Beyond the partition count, additional consumers sit idle. If you need more parallelism than you have partitions, increase the partition count first.
Tune Fetch Parameters
Adjust how much data the consumer fetches per request:
# Minimum bytes to fetch (default: 1). Increasing this reduces fetch
# request frequency but adds latency on low-throughput topics.
fetch.min.bytes=1024
# Maximum time to wait for fetch.min.bytes to accumulate (default: 500ms).
fetch.max.wait.ms=500
# Maximum bytes per partition per fetch (default: 1MB).
max.partition.fetch.bytes=2097152
# Maximum records returned per poll() call (default: 500).
# Lower this if processing is slow; raise it if processing is fast.
max.poll.records=200
If your consumer is slow, lowering max.poll.records ensures each poll() call returns quickly enough to stay within max.poll.interval.ms. If your consumer is fast but starved for data, increasing fetch.min.bytes and max.partition.fetch.bytes improves throughput.
Optimize Processing Logic
This is usually where the biggest gains come from. Common optimizations:
Batch database writes instead of writing one row per message:
// Slow: one INSERT per message
for (Record record : records) {
db.insert(record); // 5ms per call
}
// Fast: batch INSERT
List<Record> batch = new ArrayList<>(records);
db.batchInsert(batch); // 20ms for 500 records
Use async I/O for external calls:
// Slow: blocking HTTP call per message
for (Record record : records) {
httpClient.post(endpoint, record); // 100ms per call
}
// Fast: async with futures, then await all
List<CompletableFuture<Response>> futures = records.stream()
.map(r -> httpClient.postAsync(endpoint, r))
.collect(toList());
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
Move heavy processing out of the consumer. If your consumer does complex transformations, consider writing raw messages to a fast store first, then processing them separately. This decouples consumption speed from processing speed.
Fix Partition Key Skew
If lag concentrates on specific partitions, examine your key distribution:
# Count messages per partition over a time window
kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list localhost:9092 \
--topic orders \
--time -1
If one partition has far more messages, your key choice is creating skew. Options:
- Use a different key with better distribution
- Add a random suffix to the key (if ordering within the original key is not required)
- Use a custom partitioner that spreads hot keys across multiple partitions
Handle Rebalancing
Reduce rebalance frequency by tuning these settings:
# Increase poll interval to give slow consumers more time (default: 300000ms)
max.poll.interval.ms=600000
# Increase session timeout (default: 45000ms in newer versions)
session.timeout.ms=60000
# Reduce heartbeat interval (should be ~1/3 of session timeout)
heartbeat.interval.ms=20000
If you are on Kafka 2.4+, use the CooperativeStickyAssignor to avoid stop-the-world rebalances:
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
This allows incremental rebalancing where only the affected partitions are revoked, and the remaining consumers keep processing.
Use Pause/Resume for Backpressure
If certain partitions are overwhelming your consumer while others are fine, you can pause consumption on the busy partitions temporarily:
// Pause partitions that are backing up downstream
consumer.pause(Collections.singleton(new TopicPartition("orders", 1)));
// Continue processing other partitions
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
// Resume when downstream is ready
consumer.resume(Collections.singleton(new TopicPartition("orders", 1)));
This is especially useful when your consumer writes to a downstream system that has its own throughput limits (a rate-limited API, for example).
When Lag Is Actually Fine
Not all lag requires action.
Batch consumers. If your consumer runs on a schedule (every hour, every 15 minutes), it will always have lag between runs. That is by design. As long as it catches up within each cycle, the lag is expected.
Catch-up after maintenance. When you restart consumers for a deployment or configuration change, they need time to process the messages that accumulated during the downtime. Lag spikes during deployment and then falls back to normal. This is not a problem.
Non-time-sensitive workloads. If you are populating a data warehouse that refreshes nightly, a few minutes of consumer lag is irrelevant. The data just needs to be there before the next refresh.
The signal that matters is the trend. If lag spikes and then recovers, your system is healthy. If lag grows steadily over time with no recovery, you have a capacity or performance problem that needs attention.
Getting Lag Under Control
Consumer lag is not a single problem. It is a symptom of a mismatch between production rate and consumption rate. The fix depends on which side of that equation is off. Start by measuring where the lag is (which partitions, which consumers), then trace it to a root cause (slow processing, skewed partitions, rebalancing, infrastructure). Apply the targeted fix rather than the generic one. Adding consumers helps if you are partition-bound, but it does nothing if the bottleneck is a synchronous database call inside your processing loop.
Build lag monitoring into your system from day one. Expose JMX metrics, set up Burrow or equivalent trend-based alerting, and track lag per partition over time. When lag does spike, you will know immediately whether it is a one-time event or a growing problem, and you will have the data to pinpoint the cause.