What is the first thing to check when consumer lag spikes?

Check the consumer group state. Run kafka-consumer-groups.sh --describe --group your-group. If the group state is 'Rebalancing', consumers are being reassigned and no progress is being made during that time. If the state is 'Stable' but lag is growing, the consumers are running but cannot keep up with the production rate. If the state is 'Empty', no consumers are connected at all.

How do I tell if one partition is causing all the lag?

The --describe output shows lag per partition. If one partition has lag of 500,000 and the others are at 1,000, you have a hot partition — likely caused by a skewed partition key that routes too many messages to one partition. The fix is to choose a more evenly distributed partition key or to handle that specific high-volume key differently.

What causes frequent Kafka consumer rebalances?

Rebalances happen when consumers join or leave the group. Common triggers: consumers crashing (OOM, unhandled exceptions), consumers taking too long to process a batch (exceeding max.poll.interval.ms), network issues between consumers and the broker, and deploying new consumer instances. Each rebalance pauses all consumption in the group for the duration of the rebalance protocol.

When should I add more consumer instances vs optimize processing?

Add consumers when: CPU utilization on existing consumers is high (they are doing real work at capacity), and you have more partitions than consumers. Optimize processing when: CPU is low but latency per message is high (waiting on external calls, inefficient serialization), or when a specific processing step dominates the time. Always profile before scaling — throwing hardware at a code problem wastes money.

How do GC pauses cause consumer lag?

Long garbage collection pauses (typically > 5 seconds) can cause Kafka to think the consumer is dead, triggering a rebalance. Even shorter pauses delay message processing and increase latency. Check GC logs for full GC events. Common fixes: reduce heap size to avoid long old-gen collections, switch to G1 or ZGC, reduce object allocation rate by reusing buffers, and tune max.poll.records to process smaller batches.

Debug Kafka Consumer Lag: A Step-by-Step Runbook

Consumer lag is spiking, the alert fired, and you need to figure out what is wrong. This runbook walks through a systematic triage process: check the group state, identify which partitions are lagging, determine if the bottleneck is the consumer or the sink, and decide whether to scale or fix.

Each step includes the exact commands to run and what the output tells you. Bookmark this for the next 3 AM page.

For background on what consumer lag is and general strategies, see Kafka consumer lag: causes, debugging, and fixes. This runbook assumes you already understand the basics and need to triage a live incident.

Step 1: Check Consumer Group State

The first question is whether consumers are even running and connected.

kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --describe \
  --group my-cdc-sink

What to look for in the output:

GROUP          TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID                           HOST           CLIENT-ID
my-cdc-sink   cdc.orders       0          4582910         4583200         290    consumer-1-abc123                     /10.0.1.5      consumer-1
my-cdc-sink   cdc.orders       1          4581000         4583195         2195   consumer-2-def456                     /10.0.1.6      consumer-2
my-cdc-sink   cdc.orders       2          4400000         4583180         183180 consumer-3-ghi789                     /10.0.1.7      consumer-3
my-cdc-sink   cdc.orders       3          4583100         4583210         110    consumer-1-abc123                     /10.0.1.5      consumer-1

Interpreting the Output

Check GROUP STATE first (shown at the top of the output):

State	Meaning	Action
`Stable`	Consumers are connected and assigned	Proceed to step 2
`Rebalancing`	Consumers are being reassigned	Wait for rebalance to complete, then check if a consumer is crashing (step 5)
`Empty`	No consumers connected	Your consumer process is down — check container/process health
`Dead`	Group metadata has been cleaned up	Consumer group may have expired or been deleted

If the state is Empty or Dead, skip the rest of this runbook and go straight to your consumer process: check if the pod/container is running, look at its logs, and restart it.

Step 2: Identify Which Partitions Are Lagging

Look at the LAG column in the output from step 1.

Scenario A: All partitions have roughly equal lag

This means the entire consumer group is behind. Common causes:

Production rate increased (check with step 3)
Consumer processing slowed down (check with step 4)
Recent rebalance caused a gap (check with step 5)

Scenario B: One or two partitions have much higher lag than others

This is a hot partition. In the example above, partition 2 has 183,180 lag while others are under 3,000. This means:

The partition key is skewed (one key produces far more events than others)
OR the consumer assigned to that partition is sick (slow, GC pausing, network issues)

Check which consumer owns the lagging partition (CONSUMER-ID column) and investigate that specific instance.

Quick Lag Summary

# Total lag across all partitions for a group
kafka-consumer-groups.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --describe \
  --group my-cdc-sink 2>/dev/null \
  | awk 'NR>1 && $NF ~ /^[0-9]+$/ {sum += $NF; count++} END {
    print "Total lag:", sum, "across", count, "partitions"
  }'

Step 3: Check the Production Rate

Is the lag caused by consumers slowing down, or by producers speeding up?

# Check the topic's message rate using kafka-run-class
kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list $KAFKA_BOOTSTRAP \
  --topic cdc.orders \
  --time -1 > /tmp/offsets_now.txt

sleep 60

kafka-run-class.sh kafka.tools.GetOffsetShell \
  --broker-list $KAFKA_BOOTSTRAP \
  --topic cdc.orders \
  --time -1 > /tmp/offsets_later.txt

# Compare to get events/second per partition
paste /tmp/offsets_now.txt /tmp/offsets_later.txt \
  | awk -F'[:\t]' '{
    rate = ($6 - $3) / 60;
    total += rate;
    print $1 ":" $2, rate, "msgs/sec"
  } END {print "Total:", total, "msgs/sec"}'

If the production rate is significantly higher than normal, the source database is generating more changes than usual. This could be a bulk data load, a batch job, a traffic spike, or catch-up after a CDC pipeline restart.

If the production rate is normal but lag is growing, the problem is on the consumer side.

Step 4: Profile Consumer Processing Time

The consumer spends time on three things: fetching messages from Kafka, processing/transforming them, and writing to the destination. You need to know which step is slow.

Check Consumer Metrics via JMX

If your consumer exposes JMX metrics (standard for Java/JVM consumers):

# Key consumer metrics to check
# records-consumed-rate: messages consumed per second
# records-lag-max: maximum lag across assigned partitions
# fetch-latency-avg: average time to fetch from broker
# commit-latency-avg: average time to commit offsets

If you use Prometheus + JMX Exporter, query:

# Consumer throughput
rate(kafka_consumer_fetch_manager_records_consumed_total{group="my-cdc-sink"}[5m])

# Fetch latency (time spent waiting for Kafka)
kafka_consumer_fetch_manager_fetch_latency_avg{group="my-cdc-sink"}

# Processing time per record (if instrumented)
rate(app_record_processing_seconds_sum{group="my-cdc-sink"}[5m])
/ rate(app_record_processing_seconds_count{group="my-cdc-sink"}[5m])

Common Consumer-Side Bottlenecks

Symptom	Likely Cause	Fix
Low CPU, high fetch latency	Consumer is idle, waiting on Kafka	Check network, broker health
Low CPU, high processing time	Blocking calls in processing (HTTP, DB lookups)	Make calls async, batch lookups, add caching
High CPU, low throughput	Expensive serialization/deserialization	Profile, optimize serde, use a faster format
CPU spikes followed by pauses	GC pressure	Tune JVM (step 6)
Steady processing, but lag still grows	Throughput < production rate	Need more consumers (step 8)

Step 5: Check for Rebalances

Rebalances stop all consumption while partitions are reassigned. Frequent rebalances are a common cause of growing lag.

# Check consumer group membership changes
# Look for JoinGroup and LeaveGroup in broker logs
# Or check consumer logs for:
grep -i "rebalanc" /var/log/my-consumer/app.log | tail -20

Common Rebalance Triggers

1. Consumer exceeding max.poll.interval.ms

If processing a batch of records takes longer than max.poll.interval.ms (default: 5 minutes), the broker considers the consumer dead and triggers a rebalance.

# Increase if processing is legitimately slow
max.poll.interval.ms=600000

# OR reduce records per poll to process faster
max.poll.records=100

2. Session timeout

If the consumer doesn’t send a heartbeat within session.timeout.ms (default: 45 seconds in newer clients), the broker kicks it out.

# Increase if network is flaky
session.timeout.ms=60000
heartbeat.interval.ms=20000

3. Consumer crashes and restarts

Check your container orchestrator (Kubernetes, ECS) for OOMKilled or CrashLoopBackOff events:

# Kubernetes
kubectl get pods -l app=my-cdc-consumer --watch
kubectl describe pod my-cdc-consumer-xyz | grep -A5 "Last State"

# Check for OOM kills
kubectl logs my-cdc-consumer-xyz --previous | tail -50

4. Rolling deployments

Deploying new consumer versions causes a rebalance for each instance that restarts. Use Kafka’s group.instance.id (static group membership) to minimize rebalance impact:

# Assign a stable ID to each consumer instance
group.instance.id=consumer-host-1
session.timeout.ms=60000

With static membership, a consumer that reconnects within the session timeout gets its old partitions back without triggering a full rebalance.

Step 6: Check for GC Pauses

Long GC pauses cause the consumer to miss heartbeats, triggering rebalances. Even short pauses add latency.

Check GC Logs

# If using G1GC with unified logging (JDK 11+)
# Look for pauses > 200ms
grep "Pause" gc.log | awk '{
  for(i=1;i<=NF;i++) if($i ~ /ms$/) {
    gsub("ms","",$i);
    if($i+0 > 200) print $0
  }
}'

# Quick summary of GC pause durations
grep "Pause" gc.log | grep -oP '[0-9.]+ms' | sort -rn | head -20

Common GC Fixes

Problem	Fix
Long full GC pauses (> 5s)	Reduce heap size or switch to G1/ZGC
Frequent young GC	Increase young gen size or reduce allocation rate
Heap growing over time	Memory leak — profile with jmap/async-profiler
Large heap with rare but long GC	Switch to ZGC (sub-ms pauses regardless of heap size)

# JVM flags for better GC behavior
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=16m
-Xms4g -Xmx4g  # Fixed heap size avoids resize pauses

# For JDK 17+, ZGC is production-ready
-XX:+UseZGC
-Xms8g -Xmx8g

Step 7: Check the Sink/Destination

If the consumer fetches messages quickly but lag still grows, the bottleneck is likely the destination write.

Common Sink Bottlenecks

Database destinations (PostgreSQL, MySQL):

-- Check for lock contention
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL
  AND state = 'active';

-- Check connection pool usage
SELECT count(*) FROM pg_stat_activity
WHERE datname = 'destination_db';

Data warehouse destinations (Snowflake, BigQuery, Databricks):

Check for queued queries or warehouse throttling
Look at API rate limit responses (429 status codes)
Verify batch/micro-batch size — too-small batches mean too many API calls

Elasticsearch/OpenSearch:

Check bulk indexing queue depth
Look for rejected requests in cluster stats
Monitor merge activity (heavy merging blocks indexing)

General sink checks:

# Check destination write latency from consumer metrics
# If you instrument your sink connector:
rate(sink_write_duration_seconds_sum[5m])
/ rate(sink_write_duration_seconds_count[5m])

# Check for write errors
rate(sink_write_errors_total[5m])

If the sink is the bottleneck, you have several options:

Batch writes instead of single-record writes
Increase connection pool size to the destination
Scale the destination (bigger warehouse, more Elasticsearch nodes)
Reduce write frequency by buffering in the consumer

For CDC-specific sink patterns, see CDC destination patterns.

Step 8: Decide — Scale or Fix?

At this point, you know the bottleneck. Here is the decision tree:

When to Scale (Add More Consumers)

Consumer CPU is consistently > 70% across all instances
All partitions have roughly equal lag (no hot partitions)
You have more partitions than consumers (prerequisite for scaling)
Processing logic is already optimized

# Check number of partitions
kafka-topics.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --describe \
  --topic cdc.orders \
  | head -1

# If partitions = consumers, you need more partitions first
kafka-topics.sh \
  --bootstrap-server $KAFKA_BOOTSTRAP \
  --alter \
  --topic cdc.orders \
  --partitions 24

Important: Increasing partitions changes message ordering. If your pipeline depends on per-key ordering (common in CDC), verify that your partition key still routes correctly.

When to Fix (Optimize Processing)

Consumer CPU is low (< 40%) but throughput is also low → blocking I/O
One consumer is much slower than others → investigate that instance
GC pauses dominate the timeline → tune JVM
Sink is the bottleneck → optimize writes or scale the destination
Hot partition → fix partition key distribution

When to Do Both

If you are in an active incident with growing lag:

Immediate: Scale consumers (if partitions allow) to stop the bleeding
Short-term: Optimize the identified bottleneck
Long-term: Set up monitoring and alerts so you catch it earlier next time

Setting Up Lag Monitoring

After resolving the incident, set up monitoring to catch lag before it becomes an emergency:

# Prometheus alerting rules
groups:
  - name: kafka-consumer-lag
    rules:
      - alert: ConsumerLagHigh
        expr: |
          sum by (group) (
            kafka_consumer_group_lag
          ) > 100000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer group {{ $labels.group }} lag > 100k for 10 minutes"

      - alert: ConsumerLagCritical
        expr: |
          sum by (group) (
            kafka_consumer_group_lag
          ) > 1000000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Consumer group {{ $labels.group }} lag > 1M for 5 minutes"

      - alert: ConsumerLagGrowing
        expr: |
          deriv(
            sum by (group) (kafka_consumer_group_lag)[30m:]
          ) > 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Consumer group {{ $labels.group }} lag is growing steadily"

The ConsumerLagGrowing alert is the most useful for early detection — it fires when lag is trending upward, even if the absolute number is still small. This gives you time to investigate before it becomes a production incident.

For a broader view of monitoring streaming infrastructure, see data freshness monitoring.

Quick Reference: The Triage Checklist

Consumer group state → Is it Stable, Rebalancing, or Empty?
Partition lag distribution → Even lag or hot partitions?
Production rate → Did throughput spike on the source side?
Consumer processing time → Where is the time spent (fetch, process, write)?
Rebalance frequency → Are consumers bouncing?
GC pauses → Is the JVM stalling?
Sink health → Is the destination keeping up?
Scale or fix → Based on the bottleneck, choose the right action

When to Use a Managed Pipeline Instead

If you find yourself debugging consumer lag, managing rebalances, and tuning JVM settings regularly, it is worth asking whether the operational cost is justified. Managed streaming platforms like Streamkap handle consumer group management, automatic scaling, and lag monitoring internally — you configure the source and destination, and the platform manages the Kafka consumer layer.

This does not eliminate the need to understand consumer lag (destination bottlenecks can still cause backpressure, for example), but it reduces the operational surface from “everything in this runbook” to “check the destination.”

Ready to stop debugging consumer lag at 3 AM? Streamkap manages the entire streaming pipeline including consumer scaling, offset management, and automatic lag recovery — so you can focus on your data instead of your infrastructure. Start a free trial or learn more about the platform.

Products

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company