<--- Back to all resources
Debug Kafka Consumer Lag: A Step-by-Step Runbook
A practical runbook for triaging Kafka consumer lag — from checking group state to identifying slow partitions, rebalances, and sink bottlenecks.
Consumer lag is spiking, the alert fired, and you need to figure out what is wrong. This runbook walks through a systematic triage process: check the group state, identify which partitions are lagging, determine if the bottleneck is the consumer or the sink, and decide whether to scale or fix.
Each step includes the exact commands to run and what the output tells you. Bookmark this for the next 3 AM page.
For background on what consumer lag is and general strategies, see Kafka consumer lag: causes, debugging, and fixes. This runbook assumes you already understand the basics and need to triage a live incident.
Step 1: Check Consumer Group State
The first question is whether consumers are even running and connected.
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP \
--describe \
--group my-cdc-sink
What to look for in the output:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
my-cdc-sink cdc.orders 0 4582910 4583200 290 consumer-1-abc123 /10.0.1.5 consumer-1
my-cdc-sink cdc.orders 1 4581000 4583195 2195 consumer-2-def456 /10.0.1.6 consumer-2
my-cdc-sink cdc.orders 2 4400000 4583180 183180 consumer-3-ghi789 /10.0.1.7 consumer-3
my-cdc-sink cdc.orders 3 4583100 4583210 110 consumer-1-abc123 /10.0.1.5 consumer-1
Interpreting the Output
Check GROUP STATE first (shown at the top of the output):
| State | Meaning | Action |
|---|---|---|
Stable | Consumers are connected and assigned | Proceed to step 2 |
Rebalancing | Consumers are being reassigned | Wait for rebalance to complete, then check if a consumer is crashing (step 5) |
Empty | No consumers connected | Your consumer process is down — check container/process health |
Dead | Group metadata has been cleaned up | Consumer group may have expired or been deleted |
If the state is Empty or Dead, skip the rest of this runbook and go straight to your consumer process: check if the pod/container is running, look at its logs, and restart it.
Step 2: Identify Which Partitions Are Lagging
Look at the LAG column in the output from step 1.
Scenario A: All partitions have roughly equal lag
This means the entire consumer group is behind. Common causes:
- Production rate increased (check with step 3)
- Consumer processing slowed down (check with step 4)
- Recent rebalance caused a gap (check with step 5)
Scenario B: One or two partitions have much higher lag than others
This is a hot partition. In the example above, partition 2 has 183,180 lag while others are under 3,000. This means:
- The partition key is skewed (one key produces far more events than others)
- OR the consumer assigned to that partition is sick (slow, GC pausing, network issues)
Check which consumer owns the lagging partition (CONSUMER-ID column) and investigate that specific instance.
Quick Lag Summary
# Total lag across all partitions for a group
kafka-consumer-groups.sh \
--bootstrap-server $KAFKA_BOOTSTRAP \
--describe \
--group my-cdc-sink 2>/dev/null \
| awk 'NR>1 && $NF ~ /^[0-9]+$/ {sum += $NF; count++} END {
print "Total lag:", sum, "across", count, "partitions"
}'
Step 3: Check the Production Rate
Is the lag caused by consumers slowing down, or by producers speeding up?
# Check the topic's message rate using kafka-run-class
kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list $KAFKA_BOOTSTRAP \
--topic cdc.orders \
--time -1 > /tmp/offsets_now.txt
sleep 60
kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list $KAFKA_BOOTSTRAP \
--topic cdc.orders \
--time -1 > /tmp/offsets_later.txt
# Compare to get events/second per partition
paste /tmp/offsets_now.txt /tmp/offsets_later.txt \
| awk -F'[:\t]' '{
rate = ($6 - $3) / 60;
total += rate;
print $1 ":" $2, rate, "msgs/sec"
} END {print "Total:", total, "msgs/sec"}'
If the production rate is significantly higher than normal, the source database is generating more changes than usual. This could be a bulk data load, a batch job, a traffic spike, or catch-up after a CDC pipeline restart.
If the production rate is normal but lag is growing, the problem is on the consumer side.
Step 4: Profile Consumer Processing Time
The consumer spends time on three things: fetching messages from Kafka, processing/transforming them, and writing to the destination. You need to know which step is slow.
Check Consumer Metrics via JMX
If your consumer exposes JMX metrics (standard for Java/JVM consumers):
# Key consumer metrics to check
# records-consumed-rate: messages consumed per second
# records-lag-max: maximum lag across assigned partitions
# fetch-latency-avg: average time to fetch from broker
# commit-latency-avg: average time to commit offsets
If you use Prometheus + JMX Exporter, query:
# Consumer throughput
rate(kafka_consumer_fetch_manager_records_consumed_total{group="my-cdc-sink"}[5m])
# Fetch latency (time spent waiting for Kafka)
kafka_consumer_fetch_manager_fetch_latency_avg{group="my-cdc-sink"}
# Processing time per record (if instrumented)
rate(app_record_processing_seconds_sum{group="my-cdc-sink"}[5m])
/ rate(app_record_processing_seconds_count{group="my-cdc-sink"}[5m])
Common Consumer-Side Bottlenecks
| Symptom | Likely Cause | Fix |
|---|---|---|
| Low CPU, high fetch latency | Consumer is idle, waiting on Kafka | Check network, broker health |
| Low CPU, high processing time | Blocking calls in processing (HTTP, DB lookups) | Make calls async, batch lookups, add caching |
| High CPU, low throughput | Expensive serialization/deserialization | Profile, optimize serde, use a faster format |
| CPU spikes followed by pauses | GC pressure | Tune JVM (step 6) |
| Steady processing, but lag still grows | Throughput < production rate | Need more consumers (step 8) |
Step 5: Check for Rebalances
Rebalances stop all consumption while partitions are reassigned. Frequent rebalances are a common cause of growing lag.
# Check consumer group membership changes
# Look for JoinGroup and LeaveGroup in broker logs
# Or check consumer logs for:
grep -i "rebalanc" /var/log/my-consumer/app.log | tail -20
Common Rebalance Triggers
1. Consumer exceeding max.poll.interval.ms
If processing a batch of records takes longer than max.poll.interval.ms (default: 5 minutes), the broker considers the consumer dead and triggers a rebalance.
# Increase if processing is legitimately slow
max.poll.interval.ms=600000
# OR reduce records per poll to process faster
max.poll.records=100
2. Session timeout
If the consumer doesn’t send a heartbeat within session.timeout.ms (default: 45 seconds in newer clients), the broker kicks it out.
# Increase if network is flaky
session.timeout.ms=60000
heartbeat.interval.ms=20000
3. Consumer crashes and restarts
Check your container orchestrator (Kubernetes, ECS) for OOMKilled or CrashLoopBackOff events:
# Kubernetes
kubectl get pods -l app=my-cdc-consumer --watch
kubectl describe pod my-cdc-consumer-xyz | grep -A5 "Last State"
# Check for OOM kills
kubectl logs my-cdc-consumer-xyz --previous | tail -50
4. Rolling deployments
Deploying new consumer versions causes a rebalance for each instance that restarts. Use Kafka’s group.instance.id (static group membership) to minimize rebalance impact:
# Assign a stable ID to each consumer instance
group.instance.id=consumer-host-1
session.timeout.ms=60000
With static membership, a consumer that reconnects within the session timeout gets its old partitions back without triggering a full rebalance.
Step 6: Check for GC Pauses
Long GC pauses cause the consumer to miss heartbeats, triggering rebalances. Even short pauses add latency.
Check GC Logs
# If using G1GC with unified logging (JDK 11+)
# Look for pauses > 200ms
grep "Pause" gc.log | awk '{
for(i=1;i<=NF;i++) if($i ~ /ms$/) {
gsub("ms","",$i);
if($i+0 > 200) print $0
}
}'
# Quick summary of GC pause durations
grep "Pause" gc.log | grep -oP '[0-9.]+ms' | sort -rn | head -20
Common GC Fixes
| Problem | Fix |
|---|---|
| Long full GC pauses (> 5s) | Reduce heap size or switch to G1/ZGC |
| Frequent young GC | Increase young gen size or reduce allocation rate |
| Heap growing over time | Memory leak — profile with jmap/async-profiler |
| Large heap with rare but long GC | Switch to ZGC (sub-ms pauses regardless of heap size) |
# JVM flags for better GC behavior
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=16m
-Xms4g -Xmx4g # Fixed heap size avoids resize pauses
# For JDK 17+, ZGC is production-ready
-XX:+UseZGC
-Xms8g -Xmx8g
Step 7: Check the Sink/Destination
If the consumer fetches messages quickly but lag still grows, the bottleneck is likely the destination write.
Common Sink Bottlenecks
Database destinations (PostgreSQL, MySQL):
-- Check for lock contention
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL
AND state = 'active';
-- Check connection pool usage
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'destination_db';
Data warehouse destinations (Snowflake, BigQuery, Databricks):
- Check for queued queries or warehouse throttling
- Look at API rate limit responses (429 status codes)
- Verify batch/micro-batch size — too-small batches mean too many API calls
Elasticsearch/OpenSearch:
- Check bulk indexing queue depth
- Look for rejected requests in cluster stats
- Monitor merge activity (heavy merging blocks indexing)
General sink checks:
# Check destination write latency from consumer metrics
# If you instrument your sink connector:
rate(sink_write_duration_seconds_sum[5m])
/ rate(sink_write_duration_seconds_count[5m])
# Check for write errors
rate(sink_write_errors_total[5m])
If the sink is the bottleneck, you have several options:
- Batch writes instead of single-record writes
- Increase connection pool size to the destination
- Scale the destination (bigger warehouse, more Elasticsearch nodes)
- Reduce write frequency by buffering in the consumer
For CDC-specific sink patterns, see CDC destination patterns.
Step 8: Decide — Scale or Fix?
At this point, you know the bottleneck. Here is the decision tree:
When to Scale (Add More Consumers)
- Consumer CPU is consistently > 70% across all instances
- All partitions have roughly equal lag (no hot partitions)
- You have more partitions than consumers (prerequisite for scaling)
- Processing logic is already optimized
# Check number of partitions
kafka-topics.sh \
--bootstrap-server $KAFKA_BOOTSTRAP \
--describe \
--topic cdc.orders \
| head -1
# If partitions = consumers, you need more partitions first
kafka-topics.sh \
--bootstrap-server $KAFKA_BOOTSTRAP \
--alter \
--topic cdc.orders \
--partitions 24
Important: Increasing partitions changes message ordering. If your pipeline depends on per-key ordering (common in CDC), verify that your partition key still routes correctly.
When to Fix (Optimize Processing)
- Consumer CPU is low (< 40%) but throughput is also low → blocking I/O
- One consumer is much slower than others → investigate that instance
- GC pauses dominate the timeline → tune JVM
- Sink is the bottleneck → optimize writes or scale the destination
- Hot partition → fix partition key distribution
When to Do Both
If you are in an active incident with growing lag:
- Immediate: Scale consumers (if partitions allow) to stop the bleeding
- Short-term: Optimize the identified bottleneck
- Long-term: Set up monitoring and alerts so you catch it earlier next time
Setting Up Lag Monitoring
After resolving the incident, set up monitoring to catch lag before it becomes an emergency:
# Prometheus alerting rules
groups:
- name: kafka-consumer-lag
rules:
- alert: ConsumerLagHigh
expr: |
sum by (group) (
kafka_consumer_group_lag
) > 100000
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer group {{ $labels.group }} lag > 100k for 10 minutes"
- alert: ConsumerLagCritical
expr: |
sum by (group) (
kafka_consumer_group_lag
) > 1000000
for: 5m
labels:
severity: critical
annotations:
summary: "Consumer group {{ $labels.group }} lag > 1M for 5 minutes"
- alert: ConsumerLagGrowing
expr: |
deriv(
sum by (group) (kafka_consumer_group_lag)[30m:]
) > 100
for: 15m
labels:
severity: warning
annotations:
summary: "Consumer group {{ $labels.group }} lag is growing steadily"
The ConsumerLagGrowing alert is the most useful for early detection — it fires when lag is trending upward, even if the absolute number is still small. This gives you time to investigate before it becomes a production incident.
For a broader view of monitoring streaming infrastructure, see data freshness monitoring.
Quick Reference: The Triage Checklist
- Consumer group state → Is it Stable, Rebalancing, or Empty?
- Partition lag distribution → Even lag or hot partitions?
- Production rate → Did throughput spike on the source side?
- Consumer processing time → Where is the time spent (fetch, process, write)?
- Rebalance frequency → Are consumers bouncing?
- GC pauses → Is the JVM stalling?
- Sink health → Is the destination keeping up?
- Scale or fix → Based on the bottleneck, choose the right action
When to Use a Managed Pipeline Instead
If you find yourself debugging consumer lag, managing rebalances, and tuning JVM settings regularly, it is worth asking whether the operational cost is justified. Managed streaming platforms like Streamkap handle consumer group management, automatic scaling, and lag monitoring internally — you configure the source and destination, and the platform manages the Kafka consumer layer.
This does not eliminate the need to understand consumer lag (destination bottlenecks can still cause backpressure, for example), but it reduces the operational surface from “everything in this runbook” to “check the destination.”
Ready to stop debugging consumer lag at 3 AM? Streamkap manages the entire streaming pipeline including consumer scaling, offset management, and automatic lag recovery — so you can focus on your data instead of your infrastructure. Start a free trial or learn more about the platform.