<--- Back to all resources
Backpressure in Stream Processing: What It Is and How to Handle It
Learn what backpressure means in streaming pipelines, how to detect it, and practical strategies for handling it in Kafka, Flink, and CDC pipelines without losing data.
Every streaming pipeline has a speed limit, and it is set by the slowest component. When a downstream operator cannot keep up with the rate of incoming data, the system has three choices: drop records, buffer indefinitely until memory runs out, or push back against the upstream producer. That third option is backpressure, and understanding it is essential for anyone building or operating real-time data pipelines.
This guide explains how backpressure works in practice, how to detect it, what causes it, and what you can do about it.
How Backpressure Works
Backpressure is a flow control mechanism. When a consumer falls behind, it signals - directly or indirectly - that the producer should slow down. The goal is to prevent data loss and memory exhaustion by matching the entire pipeline’s throughput to the capacity of the slowest stage.
Credit-Based Flow Control in Flink
Apache Flink uses a credit-based flow control system between task managers. Each downstream task grants “credits” to its upstream task, where one credit corresponds to one network buffer. The upstream task can only send data when it holds credits. When the downstream task’s input buffers fill up because it is processing slowly, it stops granting credits. This causes the upstream task’s output buffers to fill, which in turn prevents it from processing more records. The backpressure propagates all the way to the source operator.
Source → Map → Window → Sink
←credits← ←credits← ←credits←
If the Sink is slow, credits stop flowing right-to-left. The Window operator’s output buffers fill, then the Map operator’s, and finally the Source stops reading from Kafka.
Consumer Lag in Kafka
Kafka handles backpressure differently because it decouples producers from consumers through persistent, partitioned logs. There is no direct signal from consumer to producer. Instead, backpressure manifests as consumer lag - the growing offset difference between the latest produced message and the latest consumed message. The producer continues writing at full speed while the consumer falls further behind.
# Check consumer lag for a consumer group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group my-pipeline-group
This decoupling is a strength (producers are never slowed) and a weakness (lag can grow unnoticed until retention expires and data is lost).
Detecting Backpressure
Early detection prevents small problems from cascading into pipeline failures.
Flink Web UI
The Flink dashboard provides real-time backpressure status for every operator:
- OK - Output buffers are rarely full. The operator is keeping up.
- LOW - Occasional buffer saturation. Worth monitoring.
- HIGH - Output buffers are consistently full. This operator is being slowed by a downstream bottleneck.
The backpressure tab samples the task’s stack traces to determine whether it is spending time waiting on output buffers (isBackPressured) or actively processing (isBusy).
Kafka Consumer Lag
Monitor lag with built-in tools or metrics exporters:
# One-time lag check
kafka-consumer-groups.sh --describe --group cdc-pipeline
# Key metrics to export to your monitoring system:
# - records-lag-max (per partition)
# - records-lag-avg (per partition)
# - fetch-rate (consumer-level)
A lag that grows linearly indicates a sustained throughput mismatch. A lag that spikes and then recovers indicates a transient burst. A lag that grows exponentially suggests a cascading failure.
Latency Metrics
End-to-end latency - the time between a record being produced at the source and arriving at the destination - is the most user-visible symptom of backpressure. If your CDC pipeline normally delivers changes in under 5 seconds and latency climbs to 30 seconds, backpressure is the likely cause even if you have not looked at Flink or Kafka metrics yet.
Common Causes of Backpressure
Slow Sinks
The most frequent cause. Destination systems have finite write capacity:
- A Snowflake warehouse is queued behind other queries
- BigQuery hits streaming insert quota limits
- A PostgreSQL destination has lock contention on the target table
- An Elasticsearch cluster is running a heavy merge
Expensive Transformations
Complex processing logic per record adds up:
- Joins against external systems (HTTP lookups, database enrichment)
- Large window aggregations holding significant state
- CPU-intensive serialization or format conversion
- User-defined functions with unoptimized code
Skewed Partitions
If one Kafka partition receives disproportionate traffic (a hot key), the task processing that partition becomes the bottleneck while other tasks sit idle.
Resource Constraints
Insufficient memory causes frequent garbage collection pauses. Insufficient CPU causes processing to fall behind. Network bandwidth saturation slows data transfer between operators or to external systems.
Flink Backpressure Diagnosis
Flink exposes two critical metrics per operator that help you pinpoint the bottleneck:
isBackPressured- The operator is waiting to send output because downstream buffers are full. This operator is a victim, not the cause.isBusy- The operator is actively processing records or waiting on external I/O. A high busy ratio combined with low backpressure means this operator is the bottleneck.
Finding the Bottleneck Operator
Walk the operator chain from source to sink:
Operator | BackPressured | Busy
----------------|---------------|------
Source (Kafka) | HIGH | LOW ← victim, waiting to send
Deserialize | HIGH | LOW ← victim
Filter/Transform| HIGH | LOW ← victim
Sink (Snowflake)| LOW | HIGH ← BOTTLENECK
The pattern is clear: follow the chain until you find the first operator that is BUSY but not backpressured. That is where the pipeline is choking.
In the Flink Web UI, click on any operator and open the BackPressure tab. If it shows HIGH, move downstream. When you find the operator showing OK backpressure but high busy time, you have found your target.
Common Flink Bottleneck Patterns
- Sink operator busy - Destination cannot absorb writes fast enough. Increase sink parallelism, batch writes, or scale the destination.
- Window operator busy - State access is slow, often due to RocksDB compaction or large state size. Tune RocksDB settings or reduce window size.
- Deserialization operator busy - Schema resolution or format parsing is expensive. Cache schemas, use a faster serialization format.
Kafka-Level Backpressure
While Kafka itself does not enforce backpressure on producers, managing consumer-side throughput is critical.
Consumer Lag Growth
Sustained lag growth means your consumer’s processing rate is below the production rate:
Production rate: 10,000 records/sec
Consumption rate: 7,000 records/sec
Lag growth: 3,000 records/sec
Time to hit 24h retention: ~28.8 million records / 3,000 = ~2.6 hours
If lag exceeds your topic’s retention period, records are deleted before being consumed. Data is permanently lost.
Partition Imbalance
Check per-partition lag. If one partition has significantly higher lag than others, you likely have a hot key problem:
kafka-consumer-groups.sh --describe --group my-group
# Output:
# TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
# events 0 1000000 1000500 500
# events 1 1000000 1000200 200
# events 2 1000000 1005000 5000 ← hot partition
Solutions include repartitioning with a different key or adding a pre-processing step to distribute hot keys across partitions.
Fetch Configuration Tuning
Consumer fetch settings directly affect throughput:
# Increase max bytes per fetch to reduce round trips
fetch.max.bytes=52428800
max.partition.fetch.bytes=10485760
# Increase max poll records for batch efficiency
max.poll.records=1000
# Reduce fetch wait time for lower latency
fetch.max.wait.ms=100
Larger fetch sizes improve throughput at the cost of memory. Smaller wait times improve latency at the cost of more network requests.
Handling Strategies
Scale Out the Bottleneck
If the sink is the bottleneck, increase its parallelism. In Flink, set the sink operator’s parallelism independently:
stream
.map(new TransformFunction())
.setParallelism(4)
.addSink(new SnowflakeSink())
.setParallelism(8); // double the sink parallelism
For Kafka consumers, increase the number of partitions and consumer instances (consumer parallelism cannot exceed partition count).
Optimize Transformations
Profile your processing logic. Common wins include:
- Cache external lookups - If you enrich records by calling an API, cache results with a TTL.
- Reduce serialization - Avoid repeated serialize/deserialize cycles between operators.
- Pre-aggregate - If a downstream system is slow, reduce the volume it receives by aggregating upstream.
Batch Writes to the Destination
Many destinations perform significantly better with batched writes:
# Instead of single-row inserts:
for record in batch:
execute("INSERT INTO table VALUES (?)", record)
# Use bulk operations:
execute_batch("INSERT INTO table VALUES (?)", batch, batch_size=5000)
Snowflake, BigQuery, and Elasticsearch all show 10-50x throughput improvements with properly sized batches.
Add Buffering
Use Kafka as an explicit buffer between pipeline stages. If your transformation is fast but your sink is slow, write transformed records to an intermediate topic and let the sink consumer read at its own pace. This isolates the slow component from the rest of the pipeline.
Practical Example: CDC Pipeline with a Destination Bottleneck
Consider a common scenario: a PostgreSQL source with CDC streaming changes through Kafka to a Snowflake destination.
During normal operation, the pipeline handles 2,000 changes per second with sub-10-second end-to-end latency. Then a batch job at the source performs a bulk update affecting 5 million rows.
- Kafka absorbs the initial burst - The producer writes all 5 million change events to Kafka within minutes. Kafka’s retention provides a buffer.
- Snowflake cannot keep up - The warehouse processes bulk inserts at 3,000 rows per second, but changes are arriving at 20,000 per second. Consumer lag begins growing.
- Flink shows backpressure - The sink operator shows HIGH busy time. All upstream operators show HIGH backpressure.
- Resolution - Increase Snowflake warehouse size from X-Small to Medium for the duration of the backfill, increase sink batch size from 1,000 to 10,000 rows, and add a second sink task. Throughput rises to 25,000 rows per second. Lag drains within 15 minutes.
Streamkap’s managed CDC platform handles this scenario by monitoring consumer lag and sink throughput, automatically adjusting batch sizes and scaling processing capacity when it detects sustained backpressure.
Preventing Backpressure
Capacity Planning
Know your steady-state and peak throughput requirements. Size your destination for peak, not average. A warehouse that handles normal load comfortably may buckle during end-of-day batch processing or promotional events.
Load Testing
Simulate peak volumes before they hit production:
# Generate synthetic load to test pipeline capacity
kafka-producer-perf-test.sh \
--topic test-topic \
--num-records 1000000 \
--record-size 1024 \
--throughput 50000 \
--producer-props bootstrap.servers=localhost:9092
Measure end-to-end latency and consumer lag under sustained load. If lag grows, you know your ceiling.
Auto-Scaling
Configure your pipeline to scale horizontally in response to lag growth. This means having enough Kafka partitions to allow additional consumers, using auto-scaling sink infrastructure (serverless warehouses, elastic clusters), and monitoring lag thresholds that trigger scale-up actions.
When Backpressure Is Acceptable
Not all backpressure requires intervention. It is a healthy response in several situations:
- Transient spikes - A burst of activity that resolves within minutes. Kafka’s buffering absorbs the spike, lag drains naturally, and no action is needed.
- Planned backfills - Initial data loads or historical replays will naturally cause temporary backpressure. As long as progress is steady and you are not at risk of exceeding retention, let it run.
- Graceful degradation - During a partial outage (destination maintenance, network issues), backpressure prevents data loss by slowing the pipeline rather than failing. Once the issue resolves, the pipeline catches up from the Kafka buffer.
The threshold for action depends on your latency requirements. If your SLA allows 5 minutes of latency and a spike causes 2 minutes, monitoring is sufficient. If your use case demands sub-second delivery, any sustained backpressure requires immediate response.
The key is distinguishing between backpressure as a temporary safety mechanism and backpressure as a symptom of a fundamental capacity mismatch. The former resolves on its own. The latter requires architectural changes - more parallelism, faster destinations, or reduced data volume through filtering and aggregation upstream.