<--- Back to all resources

Tutorials & How-To

May 22, 2025

16 min read

Debug Kafka Consumer Lag: A Step-by-Step Runbook

A practical runbook for triaging Kafka consumer lag — from checking group state to identifying slow partitions, rebalances, and sink bottlenecks.

Consumer lag is spiking, the alert fired, and you need to figure out what is wrong. This runbook walks through a systematic triage process: check the group state, identify which partitions are lagging, determine if the bottleneck is the consumer or the sink, and decide whether to scale or fix.

Each step includes the exact commands to run and what the output tells you. Bookmark this for the next 3 AM page.

For background on what consumer lag is and general strategies, see Kafka consumer lag: causes, debugging, and fixes. This runbook assumes you already understand the basics and need to triage a live incident.

Step 1: Check Consumer Group State

The first question is whether consumers are even running and connected.

kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --describe \
  --group my-cdc-sink

What to look for in the output:

GROUP          TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID                           HOST           CLIENT-ID
my-cdc-sink   cdc.orders       0          4582910         4583200         290    consumer-1-abc123                     /10.0.1.5      consumer-1
my-cdc-sink   cdc.orders       1          4581000         4583195         2195   consumer-2-def456                     /10.0.1.6      consumer-2
my-cdc-sink   cdc.orders       2          4400000         4583180         183180 consumer-3-ghi789                     /10.0.1.7      consumer-3
my-cdc-sink   cdc.orders       3          4583100         4583210         110    consumer-1-abc123                     /10.0.1.5      consumer-1

Interpreting the Output

Check GROUP STATE first (shown at the top of the output):

StateMeaningAction
StableConsumers are connected and assignedProceed to step 2
RebalancingConsumers are being reassignedWait for rebalance to complete, then check if a consumer is crashing (step 5)
EmptyNo consumers connectedYour consumer process is down — check container/process health
DeadGroup metadata has been cleaned upConsumer group may have expired or been deleted

If the state is Empty or Dead, skip the rest of this runbook and go straight to your consumer process: check if the pod/container is running, look at its logs, and restart it.

Step 2: Identify Which Partitions Are Lagging

Look at the LAG column in the output from step 1.

Scenario A: All partitions have roughly equal lag

This means the entire consumer group is behind. Common causes:

  • Production rate increased (check with step 3)
  • Consumer processing slowed down (check with step 4)
  • Recent rebalance caused a gap (check with step 5)

Scenario B: One or two partitions have much higher lag than others

This is a hot partition. In the example above, partition 2 has 183,180 lag while others are under 3,000. This means:

  • The partition key is skewed (one key produces far more events than others)
  • OR the consumer assigned to that partition is sick (slow, GC pausing, network issues)

Check which consumer owns the lagging partition (CONSUMER-ID column) and investigate that specific instance.

Quick Lag Summary

# Total lag across all partitions for a group
kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --describe \
  --group my-cdc-sink 2>/dev/null \
  | awk 'NR>1 && $NF ~ /^[0-9]+$/ {sum += $NF; count++} END {
    print "Total lag:", sum, "across", count, "partitions"
  }'

Step 3: Check the Production Rate

Is the lag caused by consumers slowing down, or by producers speeding up?

# Check the topic's message rate using kafka-run-class
kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list $KAFKA_BOOTSTRAP \
  --topic cdc.orders \
  --time -1 > /tmp/offsets_now.txt

sleep 60

kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list $KAFKA_BOOTSTRAP \
  --topic cdc.orders \
  --time -1 > /tmp/offsets_later.txt

# Compare to get events/second per partition
paste /tmp/offsets_now.txt /tmp/offsets_later.txt \
  | awk -F'[:\t]' '{
    rate = ($6 - $3) / 60;
    total += rate;
    print $1 ":" $2, rate, "msgs/sec"
  } END {print "Total:", total, "msgs/sec"}'

If the production rate is significantly higher than normal, the source database is generating more changes than usual. This could be a bulk data load, a batch job, a traffic spike, or catch-up after a CDC pipeline restart.

If the production rate is normal but lag is growing, the problem is on the consumer side.

Step 4: Profile Consumer Processing Time

The consumer spends time on three things: fetching messages from Kafka, processing/transforming them, and writing to the destination. You need to know which step is slow.

Check Consumer Metrics via JMX

If your consumer exposes JMX metrics (standard for Java/JVM consumers):

# Key consumer metrics to check
# records-consumed-rate: messages consumed per second
# records-lag-max: maximum lag across assigned partitions
# fetch-latency-avg: average time to fetch from broker
# commit-latency-avg: average time to commit offsets

If you use Prometheus + JMX Exporter, query:

# Consumer throughput
rate(kafka_consumer_fetch_manager_records_consumed_total{group="my-cdc-sink"}[5m])

# Fetch latency (time spent waiting for Kafka)
kafka_consumer_fetch_manager_fetch_latency_avg{group="my-cdc-sink"}

# Processing time per record (if instrumented)
rate(app_record_processing_seconds_sum{group="my-cdc-sink"}[5m])
/ rate(app_record_processing_seconds_count{group="my-cdc-sink"}[5m])

Common Consumer-Side Bottlenecks

SymptomLikely CauseFix
Low CPU, high fetch latencyConsumer is idle, waiting on KafkaCheck network, broker health
Low CPU, high processing timeBlocking calls in processing (HTTP, DB lookups)Make calls async, batch lookups, add caching
High CPU, low throughputExpensive serialization/deserializationProfile, optimize serde, use a faster format
CPU spikes followed by pausesGC pressureTune JVM (step 6)
Steady processing, but lag still growsThroughput < production rateNeed more consumers (step 8)

Step 5: Check for Rebalances

Rebalances stop all consumption while partitions are reassigned. Frequent rebalances are a common cause of growing lag.

# Check consumer group membership changes
# Look for JoinGroup and LeaveGroup in broker logs
# Or check consumer logs for:
grep -i "rebalanc" /var/log/my-consumer/app.log | tail -20

Common Rebalance Triggers

1. Consumer exceeding max.poll.interval.ms

If processing a batch of records takes longer than max.poll.interval.ms (default: 5 minutes), the broker considers the consumer dead and triggers a rebalance.

# Increase if processing is legitimately slow
max.poll.interval.ms=600000

# OR reduce records per poll to process faster
max.poll.records=100

2. Session timeout

If the consumer doesn’t send a heartbeat within session.timeout.ms (default: 45 seconds in newer clients), the broker kicks it out.

# Increase if network is flaky
session.timeout.ms=60000
heartbeat.interval.ms=20000

3. Consumer crashes and restarts

Check your container orchestrator (Kubernetes, ECS) for OOMKilled or CrashLoopBackOff events:

# Kubernetes
kubectl get pods -l app=my-cdc-consumer --watch
kubectl describe pod my-cdc-consumer-xyz | grep -A5 "Last State"

# Check for OOM kills
kubectl logs my-cdc-consumer-xyz --previous | tail -50

4. Rolling deployments

Deploying new consumer versions causes a rebalance for each instance that restarts. Use Kafka’s group.instance.id (static group membership) to minimize rebalance impact:

# Assign a stable ID to each consumer instance
group.instance.id=consumer-host-1
session.timeout.ms=60000

With static membership, a consumer that reconnects within the session timeout gets its old partitions back without triggering a full rebalance.

Step 6: Check for GC Pauses

Long GC pauses cause the consumer to miss heartbeats, triggering rebalances. Even short pauses add latency.

Check GC Logs

# If using G1GC with unified logging (JDK 11+)
# Look for pauses > 200ms
grep "Pause" gc.log | awk '{
  for(i=1;i<=NF;i++) if($i ~ /ms$/) {
    gsub("ms","",$i);
    if($i+0 > 200) print $0
  }
}'

# Quick summary of GC pause durations
grep "Pause" gc.log | grep -oP '[0-9.]+ms' | sort -rn | head -20

Common GC Fixes

ProblemFix
Long full GC pauses (> 5s)Reduce heap size or switch to G1/ZGC
Frequent young GCIncrease young gen size or reduce allocation rate
Heap growing over timeMemory leak — profile with jmap/async-profiler
Large heap with rare but long GCSwitch to ZGC (sub-ms pauses regardless of heap size)
# JVM flags for better GC behavior
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=16m
-Xms4g -Xmx4g  # Fixed heap size avoids resize pauses

# For JDK 17+, ZGC is production-ready
-XX:+UseZGC
-Xms8g -Xmx8g

Step 7: Check the Sink/Destination

If the consumer fetches messages quickly but lag still grows, the bottleneck is likely the destination write.

Common Sink Bottlenecks

Database destinations (PostgreSQL, MySQL):

-- Check for lock contention
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL
  AND state = 'active';

-- Check connection pool usage
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'destination_db';

Data warehouse destinations (Snowflake, BigQuery, Databricks):

  • Check for queued queries or warehouse throttling
  • Look at API rate limit responses (429 status codes)
  • Verify batch/micro-batch size — too-small batches mean too many API calls

Elasticsearch/OpenSearch:

  • Check bulk indexing queue depth
  • Look for rejected requests in cluster stats
  • Monitor merge activity (heavy merging blocks indexing)

General sink checks:

# Check destination write latency from consumer metrics
# If you instrument your sink connector:
rate(sink_write_duration_seconds_sum[5m])
/ rate(sink_write_duration_seconds_count[5m])

# Check for write errors
rate(sink_write_errors_total[5m])

If the sink is the bottleneck, you have several options:

  • Batch writes instead of single-record writes
  • Increase connection pool size to the destination
  • Scale the destination (bigger warehouse, more Elasticsearch nodes)
  • Reduce write frequency by buffering in the consumer

For CDC-specific sink patterns, see CDC destination patterns.

Step 8: Decide — Scale or Fix?

At this point, you know the bottleneck. Here is the decision tree:

When to Scale (Add More Consumers)

  • Consumer CPU is consistently > 70% across all instances
  • All partitions have roughly equal lag (no hot partitions)
  • You have more partitions than consumers (prerequisite for scaling)
  • Processing logic is already optimized
# Check number of partitions
kafka-topics.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --describe \
  --topic cdc.orders \
  | head -1

# If partitions = consumers, you need more partitions first
kafka-topics.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --alter \
  --topic cdc.orders \
  --partitions 24

Important: Increasing partitions changes message ordering. If your pipeline depends on per-key ordering (common in CDC), verify that your partition key still routes correctly.

When to Fix (Optimize Processing)

  • Consumer CPU is low (< 40%) but throughput is also low → blocking I/O
  • One consumer is much slower than others → investigate that instance
  • GC pauses dominate the timeline → tune JVM
  • Sink is the bottleneck → optimize writes or scale the destination
  • Hot partition → fix partition key distribution

When to Do Both

If you are in an active incident with growing lag:

  1. Immediate: Scale consumers (if partitions allow) to stop the bleeding
  2. Short-term: Optimize the identified bottleneck
  3. Long-term: Set up monitoring and alerts so you catch it earlier next time

Setting Up Lag Monitoring

After resolving the incident, set up monitoring to catch lag before it becomes an emergency:

# Prometheus alerting rules
groups:
  - name: kafka-consumer-lag
    rules:
      - alert: ConsumerLagHigh
        expr: |
          sum by (group) (
            kafka_consumer_group_lag
          ) > 100000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer group {{ $labels.group }} lag > 100k for 10 minutes"

      - alert: ConsumerLagCritical
        expr: |
          sum by (group) (
            kafka_consumer_group_lag
          ) > 1000000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Consumer group {{ $labels.group }} lag > 1M for 5 minutes"

      - alert: ConsumerLagGrowing
        expr: |
          deriv(
            sum by (group) (kafka_consumer_group_lag)[30m:]
          ) > 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Consumer group {{ $labels.group }} lag is growing steadily"

The ConsumerLagGrowing alert is the most useful for early detection — it fires when lag is trending upward, even if the absolute number is still small. This gives you time to investigate before it becomes a production incident.

For a broader view of monitoring streaming infrastructure, see data freshness monitoring.

Quick Reference: The Triage Checklist

  1. Consumer group state → Is it Stable, Rebalancing, or Empty?
  2. Partition lag distribution → Even lag or hot partitions?
  3. Production rate → Did throughput spike on the source side?
  4. Consumer processing time → Where is the time spent (fetch, process, write)?
  5. Rebalance frequency → Are consumers bouncing?
  6. GC pauses → Is the JVM stalling?
  7. Sink health → Is the destination keeping up?
  8. Scale or fix → Based on the bottleneck, choose the right action

When to Use a Managed Pipeline Instead

If you find yourself debugging consumer lag, managing rebalances, and tuning JVM settings regularly, it is worth asking whether the operational cost is justified. Managed streaming platforms like Streamkap handle consumer group management, automatic scaling, and lag monitoring internally — you configure the source and destination, and the platform manages the Kafka consumer layer.

This does not eliminate the need to understand consumer lag (destination bottlenecks can still cause backpressure, for example), but it reduces the operational surface from “everything in this runbook” to “check the destination.”


Ready to stop debugging consumer lag at 3 AM? Streamkap manages the entire streaming pipeline including consumer scaling, offset management, and automatic lag recovery — so you can focus on your data instead of your infrastructure. Start a free trial or learn more about the platform.