What is Kafka consumer lag?

Consumer lag is the difference between the latest offset in a partition (the most recent message produced) and the current offset of a consumer group (the most recent message consumed). A lag of 1000 means there are 1000 messages waiting to be processed. Some lag is normal, but growing lag means consumers are falling behind producers.

How do I check Kafka consumer lag?

The simplest way is the built-in CLI: kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group. This shows current offset, log end offset, and lag per partition. For continuous monitoring, expose JMX metrics (records-lag-max, records-lag) to Prometheus/Grafana, or use Burrow for automated lag evaluation with consumer and partition status.

What is an acceptable level of consumer lag?

It depends on your use case. For real-time analytics or CDC pipelines, you typically want lag under a few hundred messages with sub-second processing delay. For batch-oriented consumers that process hourly, thousands of messages of lag may be fine. The key metric is whether lag is stable or growing. Stable lag at any level is usually acceptable. Growing lag means you have a problem.

Can adding more consumers reduce lag?

Only if you have more partitions than consumers. Kafka assigns at most one consumer per partition within a consumer group. If you have 10 partitions and 10 consumers, adding an 11th consumer won't help. You would need to increase the partition count first, then add consumers. Also check if the bottleneck is processing speed rather than consumer count.

How does Streamkap handle consumer lag in CDC pipelines?

Streamkap manages the full CDC pipeline from source database to destination, including all Kafka consumption internally. The platform automatically monitors consumer lag, scales processing capacity, and alerts on anomalies. You never need to tune consumer groups, manage offsets, or debug rebalancing issues. If you are running self-managed Kafka consumers for CDC, Streamkap replaces that entire operational burden.

<--- Back to all resources

Stream Processing

February 26, 2026

12 min read

Kafka Consumer Lag: Causes, Debugging, and Fixes

Consumer lag is the most common Kafka operational issue. Learn what causes it, how to measure it, and practical strategies to bring it under control.

TL;DR: Kafka consumer lag means your consumers can't keep up with producers. The usual suspects are slow message processing, partition imbalance, undersized consumer groups, GC pauses, and network bottlenecks. Monitor with kafka-consumer-groups.sh, JMX metrics, or Burrow. Fix by scaling consumers, tuning fetch parameters, optimizing processing logic, and rebalancing partitions.

Consumer lag is the single most common operational headache in Kafka. It shows up in monitoring dashboards, triggers pages at 3 AM, and sparks debates about whether to add partitions or rewrite the consumer. Before throwing hardware at the problem, it pays to understand what lag actually is, why it happens, and which fixes work for which root causes.

What Consumer Lag Actually Means

Every Kafka partition maintains a sequence of messages, each identified by an offset (a monotonically increasing integer). When a producer writes a message, the partition’s log-end offset advances. When a consumer reads and commits a message, its committed offset advances.

Consumer lag is the gap between those two numbers:

Lag = Log-End Offset - Consumer Committed Offset

Here is a simplified view of a single partition:

Partition 0:
  [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
                              ^                          ^
                     Consumer Offset (6)        Log-End Offset (12)

  Lag = 12 - 6 = 6 messages

In a topic with multiple partitions, total consumer group lag is the sum of lag across all assigned partitions. A consumer group with 8 partitions each lagging by 500 messages has a total lag of 4,000.

Some lag is always present. Producers and consumers operate asynchronously, so there is always a small gap between when a message is written and when it gets processed. The question is whether that gap is stable or growing. Stable lag of a few hundred messages is normal in most systems. Lag that climbs continuously means consumers are falling behind, and the backlog will keep growing until something changes.

Common Causes of Consumer Lag

Slow Message Processing

This is the most frequent cause. If your consumer takes 50ms to process each message and you receive 1,000 messages per second on a single partition, one consumer needs 50 seconds to work through what arrived in one second. The math doesn’t work.

Typical slow-processing culprits include:

Synchronous database writes per message instead of batching
HTTP calls to external APIs inside the processing loop
Heavy transformations (parsing large JSON, running regex, computing aggregations)
Locking or contention in multi-threaded consumers

Partition Skew (Hot Partitions)

If your producer uses a message key for partitioning, and certain keys appear far more often than others, some partitions will receive disproportionate traffic. One partition might get 80% of the messages while the others sit nearly idle.

The consumer assigned to the hot partition falls behind while the other consumers have nothing to do. Total lag concentrates in one or two partitions rather than being spread evenly.

Consumer Group Rebalancing

Every time a consumer joins or leaves a group, Kafka triggers a rebalance. During a rebalance, all consumers in the group stop processing while partition assignments are redistributed. In older protocol versions, this stop-the-world pause can last seconds to minutes depending on group size.

Frequent rebalancing is often caused by:

Consumers exceeding session.timeout.ms due to slow processing
Consumers exceeding max.poll.interval.ms between poll() calls
Unstable deployments where consumers restart repeatedly
Network blips that make the group coordinator think a consumer died

Each rebalance introduces a processing gap where lag grows.

Garbage Collection Pauses

JVM-based consumers (which covers most Kafka client libraries) are subject to GC pauses. A long GC pause can cause a consumer to miss its heartbeat deadline, which triggers a rebalance. Even without triggering a rebalance, a 2-second GC pause means 2 seconds of zero processing, during which lag accumulates.

This is especially common with consumers that hold large in-memory state, use default JVM heap settings, or process large messages that create heavy object churn.

Network and Infrastructure Bottlenecks

Consumers fetch messages over the network. If the network between your consumers and brokers is saturated, or if the brokers themselves are under disk I/O pressure, fetch requests will be slow. You will see high fetch latency and low throughput even though the consumer processing logic is fast.

Cross-region consumption (consumers in us-east-1 reading from brokers in eu-west-1, for example) adds latency to every fetch cycle.

Deserialization Overhead

If your messages use a schema (Avro, Protobuf, JSON Schema) and the consumer must deserialize each one, the cost of deserialization can become significant at high throughput. Schema registry lookups, especially uncached ones, add latency per message.

Too Few Consumers for the Partition Count

Kafka’s parallelism model is partition-based. Within a consumer group, each partition is assigned to exactly one consumer. If you have 20 partitions and only 2 consumers, each consumer handles 10 partitions serially. Adding more consumers (up to the partition count) distributes the load.

Measuring and Monitoring Lag

The CLI: kafka-consumer-groups.sh

The fastest way to check lag is the built-in command-line tool:

kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group my-consumer-group

Output looks like this:

GROUP              TOPIC        PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID                                HOST            CLIENT-ID
my-consumer-group  orders       0          145230          145235          5      consumer-1-abc-xyz                         /10.0.1.12      consumer-1
my-consumer-group  orders       1          298100          312400          14300  consumer-2-def-uvw                         /10.0.1.13      consumer-2
my-consumer-group  orders       2          201800          201805          5      consumer-3-ghi-rst                         /10.0.1.14      consumer-3
my-consumer-group  orders       3          176500          176504          4      consumer-1-abc-xyz                         /10.0.1.12      consumer-1

In this example, partition 1 has a lag of 14,300 while the others are nearly caught up. That points to either a hot partition or a slow consumer on that assignment.

To list all consumer groups:

kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --list

To see the overall state of a group:

kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group my-consumer-group \
  --state

JMX Metrics

Kafka consumers expose several useful metrics via JMX. The most important ones for lag monitoring:

Metric	MBean	What It Tells You
`records-lag-max`	`kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id}`	Highest lag across all assigned partitions
`records-lag`	`kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id},topic={t},partition={p}`	Lag for a specific partition
`fetch-rate`	`kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id}`	Fetch requests per second
`records-consumed-rate`	`kafka.consumer:type=consumer-fetch-manager-metrics,client-id={id}`	Messages consumed per second
`commit-rate`	`kafka.consumer:type=consumer-coordinator-metrics,client-id={id}`	Offset commits per second

Expose these via a JMX exporter to Prometheus, then build Grafana dashboards showing lag over time per partition.

Burrow for Automated Lag Evaluation

LinkedIn’s open-source tool Burrow goes beyond raw lag numbers. It evaluates whether a consumer is healthy by looking at the lag trend over a sliding window, not just the current value. Burrow classifies consumers as OK, WARNING, or ERR based on whether lag is decreasing, stable, or increasing.

Burrow exposes an HTTP API:

# Get consumer group status
curl http://localhost:8000/v3/kafka/local/consumer/my-consumer-group/status

# Get lag for a specific topic
curl http://localhost:8000/v3/kafka/local/consumer/my-consumer-group/topic/orders

This is especially useful for alerting. A raw lag threshold (alert if lag > 10,000) generates false positives during normal traffic spikes. Burrow’s trend-based evaluation avoids that.

A Debugging Workflow

When you get paged for consumer lag, here is a systematic approach.

Step 1: Is lag growing or stable?

Check lag at two points a minute apart. If it is growing, the problem is active. If it is stable (even if high), the consumer might be catching up after a restart or deployment.

# Check lag now
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group $GROUP

# Wait 60 seconds, check again
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group $GROUP

Compare the LAG column. If CURRENT-OFFSET is advancing but LAG is still growing, the consumer is working but not fast enough.

Step 2: Which partitions are lagging?

If lag is concentrated in one or two partitions, you likely have a hot partition (key skew) or one slow consumer instance. If lag is uniform across all partitions, the problem is global (not enough throughput, a shared downstream dependency is slow, etc.).

Step 3: Check consumer group membership.

kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group $GROUP --members

Verify that the expected number of consumers are in the group. If a consumer dropped out, its partitions were reassigned to the remaining consumers, increasing their load.

Step 4: Look at processing times.

Add instrumentation to your consumer code that logs or emits metrics for processing time per message or per batch. If the p99 processing time is 200ms and you are polling 500 records at a time, each poll cycle takes at least 100 seconds. That number needs to be lower than max.poll.interval.ms or you will trigger rebalances.

Step 5: Check for rebalancing.

Search consumer logs for rebalance or revoked. Frequent rebalancing is both a symptom and a cause of lag. Each rebalance creates a processing pause, which increases lag, which can cause the consumer to exceed max.poll.interval.ms, which triggers another rebalance.

Step 6: Inspect GC logs.

If you are running JVM consumers, check for long GC pauses. Enable GC logging with:

-Xlog:gc*:file=/var/log/kafka-consumer-gc.log:time,uptime,level,tags

Look for pauses longer than your session.timeout.ms setting.

Fixing Consumer Lag

Scale Consumers (Up to Partition Count)

The simplest fix: add more consumer instances. Each new consumer takes over some partitions from existing consumers, distributing the processing load.

Partitions: 12
Current consumers: 3 (each handles 4 partitions)
After scaling to 6: each handles 2 partitions
After scaling to 12: each handles 1 partition (maximum parallelism)

Beyond the partition count, additional consumers sit idle. If you need more parallelism than you have partitions, increase the partition count first.

Tune Fetch Parameters

Adjust how much data the consumer fetches per request:

# Minimum bytes to fetch (default: 1). Increasing this reduces fetch
# request frequency but adds latency on low-throughput topics.
fetch.min.bytes=1024

# Maximum time to wait for fetch.min.bytes to accumulate (default: 500ms).
fetch.max.wait.ms=500

# Maximum bytes per partition per fetch (default: 1MB).
max.partition.fetch.bytes=2097152

# Maximum records returned per poll() call (default: 500).
# Lower this if processing is slow; raise it if processing is fast.
max.poll.records=200

If your consumer is slow, lowering max.poll.records ensures each poll() call returns quickly enough to stay within max.poll.interval.ms. If your consumer is fast but starved for data, increasing fetch.min.bytes and max.partition.fetch.bytes improves throughput.

Optimize Processing Logic

This is usually where the biggest gains come from. Common optimizations:

Batch database writes instead of writing one row per message:

// Slow: one INSERT per message
for (Record record : records) {
    db.insert(record);  // 5ms per call
}

// Fast: batch INSERT
List<Record> batch = new ArrayList<>(records);
db.batchInsert(batch);  // 20ms for 500 records

Use async I/O for external calls:

// Slow: blocking HTTP call per message
for (Record record : records) {
    httpClient.post(endpoint, record);  // 100ms per call
}

// Fast: async with futures, then await all
List<CompletableFuture<Response>> futures = records.stream()
    .map(r -> httpClient.postAsync(endpoint, r))
    .collect(toList());
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

Move heavy processing out of the consumer. If your consumer does complex transformations, consider writing raw messages to a fast store first, then processing them separately. This decouples consumption speed from processing speed.

Fix Partition Key Skew

If lag concentrates on specific partitions, examine your key distribution:

# Count messages per partition over a time window
kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list localhost:9092 \
  --topic orders \
  --time -1

If one partition has far more messages, your key choice is creating skew. Options:

Use a different key with better distribution
Add a random suffix to the key (if ordering within the original key is not required)
Use a custom partitioner that spreads hot keys across multiple partitions

Handle Rebalancing

Reduce rebalance frequency by tuning these settings:

# Increase poll interval to give slow consumers more time (default: 300000ms)
max.poll.interval.ms=600000

# Increase session timeout (default: 45000ms in newer versions)
session.timeout.ms=60000

# Reduce heartbeat interval (should be ~1/3 of session timeout)
heartbeat.interval.ms=20000

If you are on Kafka 2.4+, use the CooperativeStickyAssignor to avoid stop-the-world rebalances:

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

This allows incremental rebalancing where only the affected partitions are revoked, and the remaining consumers keep processing.

Use Pause/Resume for Backpressure

If certain partitions are overwhelming your consumer while others are fine, you can pause consumption on the busy partitions temporarily:

// Pause partitions that are backing up downstream
consumer.pause(Collections.singleton(new TopicPartition("orders", 1)));

// Continue processing other partitions
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

// Resume when downstream is ready
consumer.resume(Collections.singleton(new TopicPartition("orders", 1)));

This is especially useful when your consumer writes to a downstream system that has its own throughput limits (a rate-limited API, for example).

When Lag Is Actually Fine

Not all lag requires action.

Batch consumers. If your consumer runs on a schedule (every hour, every 15 minutes), it will always have lag between runs. That is by design. As long as it catches up within each cycle, the lag is expected.

Catch-up after maintenance. When you restart consumers for a deployment or configuration change, they need time to process the messages that accumulated during the downtime. Lag spikes during deployment and then falls back to normal. This is not a problem.

Non-time-sensitive workloads. If you are populating a data warehouse that refreshes nightly, a few minutes of consumer lag is irrelevant. The data just needs to be there before the next refresh.

The signal that matters is the trend. If lag spikes and then recovers, your system is healthy. If lag grows steadily over time with no recovery, you have a capacity or performance problem that needs attention.

Getting Lag Under Control

Consumer lag is not a single problem. It is a symptom of a mismatch between production rate and consumption rate. The fix depends on which side of that equation is off. Start by measuring where the lag is (which partitions, which consumers), then trace it to a root cause (slow processing, skewed partitions, rebalancing, infrastructure). Apply the targeted fix rather than the generic one. Adding consumers helps if you are partition-bound, but it does nothing if the bottleneck is a synchronous database call inside your processing loop.

Build lag monitoring into your system from day one. Expose JMX metrics, set up Burrow or equivalent trend-based alerting, and track lag per partition over time. When lag does spike, you will know immediately whether it is a one-time event or a growing problem, and you will have the data to pinpoint the cause.

Tired of managing Kafka consumers for CDC pipelines? Streamkap handles Kafka infrastructure, consumer lag, and scaling automatically so you can focus on using your data. Start a free trial or learn more about Streamkap’s platform.

Products

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company