What is backpressure in stream processing?

Backpressure is a flow control mechanism where a slow downstream component signals upstream components to slow down. Instead of dropping data or buffering indefinitely, the system naturally throttles the entire pipeline to the speed of the slowest component. While backpressure prevents data loss, sustained backpressure causes growing latency and can cascade into failures.

How do I detect backpressure in Flink?

The Flink Web UI shows backpressure status per operator - HIGH, LOW, or OK. The backpressure metric (isBackPressured) indicates whether output buffers are full. A high backpressure ratio on an operator means it is being slowed down by a downstream bottleneck. Trace the backpressure chain to find the root cause, which is typically the first operator that is BUSY (not backpressured).

What causes backpressure in CDC pipelines?

Common causes include: slow destination writes (Snowflake warehouse queuing, BigQuery quota limits), expensive transformations (complex joins, UDFs), serialization/deserialization overhead, network latency to destination, and sudden data volume spikes (large batch updates or backfill operations at the source).

How does Streamkap handle backpressure?

Streamkap's managed platform monitors backpressure across all pipeline components and auto-scales processing capacity to handle volume spikes. Kafka provides natural buffering between source and destination, absorbing temporary throughput mismatches. The platform alerts on sustained backpressure and provides tuning recommendations.

<--- Back to all resources

Stream Processing

February 25, 2026

9 min read

Backpressure in Stream Processing: What It Is and How to Handle It

Learn what backpressure means in streaming pipelines, how to detect it, and practical strategies for handling it in Kafka, Flink, and CDC pipelines without losing data.

TL;DR: • Backpressure occurs when a downstream component cannot process data as fast as the upstream component produces it - causing data to queue up, latency to increase, and eventually the pipeline to fail. • In Flink, backpressure propagates upstream through the network buffers, slowing down source reading. In Kafka, it manifests as growing consumer lag. • The fix depends on the bottleneck: scale the slow operator, increase parallelism, optimize the slow query, batch writes more efficiently, or add buffering.

Every streaming pipeline has a speed limit, and it is set by the slowest component. When a downstream operator cannot keep up with the rate of incoming data, the system has three choices: drop records, buffer indefinitely until memory runs out, or push back against the upstream producer. That third option is backpressure, and understanding it is essential for anyone building or operating real-time data pipelines.

This guide explains how backpressure works in practice, how to detect it, what causes it, and what you can do about it.

How Backpressure Works

Backpressure is a flow control mechanism. When a consumer falls behind, it signals - directly or indirectly - that the producer should slow down. The goal is to prevent data loss and memory exhaustion by matching the entire pipeline’s throughput to the capacity of the slowest stage.

Credit-Based Flow Control in Flink

Apache Flink uses a credit-based flow control system between task managers. Each downstream task grants “credits” to its upstream task, where one credit corresponds to one network buffer. The upstream task can only send data when it holds credits. When the downstream task’s input buffers fill up because it is processing slowly, it stops granting credits. This causes the upstream task’s output buffers to fill, which in turn prevents it from processing more records. The backpressure propagates all the way to the source operator.

Source → Map → Window → Sink
  ←credits←  ←credits←  ←credits←

If the Sink is slow, credits stop flowing right-to-left. The Window operator’s output buffers fill, then the Map operator’s, and finally the Source stops reading from Kafka.

Consumer Lag in Kafka

Kafka handles backpressure differently because it decouples producers from consumers through persistent, partitioned logs. There is no direct signal from consumer to producer. Instead, backpressure manifests as consumer lag - the growing offset difference between the latest produced message and the latest consumed message. The producer continues writing at full speed while the consumer falls further behind.

# Check consumer lag for a consumer group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-pipeline-group

This decoupling is a strength (producers are never slowed) and a weakness (lag can grow unnoticed until retention expires and data is lost).

Detecting Backpressure

Early detection prevents small problems from cascading into pipeline failures.

Flink Web UI

The Flink dashboard provides real-time backpressure status for every operator:

OK - Output buffers are rarely full. The operator is keeping up.
LOW - Occasional buffer saturation. Worth monitoring.
HIGH - Output buffers are consistently full. This operator is being slowed by a downstream bottleneck.

The backpressure tab samples the task’s stack traces to determine whether it is spending time waiting on output buffers (isBackPressured) or actively processing (isBusy).

Kafka Consumer Lag

Monitor lag with built-in tools or metrics exporters:

# One-time lag check
kafka-consumer-groups.sh --describe --group cdc-pipeline

# Key metrics to export to your monitoring system:
# - records-lag-max (per partition)
# - records-lag-avg (per partition)
# - fetch-rate (consumer-level)

A lag that grows linearly indicates a sustained throughput mismatch. A lag that spikes and then recovers indicates a transient burst. A lag that grows exponentially suggests a cascading failure.

Latency Metrics

End-to-end latency - the time between a record being produced at the source and arriving at the destination - is the most user-visible symptom of backpressure. If your CDC pipeline normally delivers changes in under 5 seconds and latency climbs to 30 seconds, backpressure is the likely cause even if you have not looked at Flink or Kafka metrics yet.

Common Causes of Backpressure

Slow Sinks

The most frequent cause. Destination systems have finite write capacity:

A Snowflake warehouse is queued behind other queries
BigQuery hits streaming insert quota limits
A PostgreSQL destination has lock contention on the target table
An Elasticsearch cluster is running a heavy merge

Expensive Transformations

Complex processing logic per record adds up:

Joins against external systems (HTTP lookups, database enrichment)
Large window aggregations holding significant state
CPU-intensive serialization or format conversion
User-defined functions with unoptimized code

Skewed Partitions

If one Kafka partition receives disproportionate traffic (a hot key), the task processing that partition becomes the bottleneck while other tasks sit idle.

Resource Constraints

Insufficient memory causes frequent garbage collection pauses. Insufficient CPU causes processing to fall behind. Network bandwidth saturation slows data transfer between operators or to external systems.

Flink Backpressure Diagnosis

Flink exposes two critical metrics per operator that help you pinpoint the bottleneck:

isBackPressured - The operator is waiting to send output because downstream buffers are full. This operator is a victim, not the cause.
isBusy - The operator is actively processing records or waiting on external I/O. A high busy ratio combined with low backpressure means this operator is the bottleneck.

Finding the Bottleneck Operator

Walk the operator chain from source to sink:

Operator        | BackPressured | Busy
----------------|---------------|------
Source (Kafka)  | HIGH          | LOW   ← victim, waiting to send
Deserialize     | HIGH          | LOW   ← victim
Filter/Transform| HIGH          | LOW   ← victim
Sink (Snowflake)| LOW           | HIGH  ← BOTTLENECK

The pattern is clear: follow the chain until you find the first operator that is BUSY but not backpressured. That is where the pipeline is choking.

In the Flink Web UI, click on any operator and open the BackPressure tab. If it shows HIGH, move downstream. When you find the operator showing OK backpressure but high busy time, you have found your target.

Common Flink Bottleneck Patterns

Sink operator busy - Destination cannot absorb writes fast enough. Increase sink parallelism, batch writes, or scale the destination.
Window operator busy - State access is slow, often due to RocksDB compaction or large state size. Tune RocksDB settings or reduce window size.
Deserialization operator busy - Schema resolution or format parsing is expensive. Cache schemas, use a faster serialization format.

Kafka-Level Backpressure

While Kafka itself does not enforce backpressure on producers, managing consumer-side throughput is critical.

Consumer Lag Growth

Sustained lag growth means your consumer’s processing rate is below the production rate:

Production rate:  10,000 records/sec
Consumption rate:  7,000 records/sec
Lag growth:        3,000 records/sec
Time to hit 24h retention: ~28.8 million records / 3,000 = ~2.6 hours

If lag exceeds your topic’s retention period, records are deleted before being consumed. Data is permanently lost.

Partition Imbalance

Check per-partition lag. If one partition has significantly higher lag than others, you likely have a hot key problem:

kafka-consumer-groups.sh --describe --group my-group

# Output:
# TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# events   0          1000000         1000500         500
# events   1          1000000         1000200         200
# events   2          1000000         1005000         5000  ← hot partition

Solutions include repartitioning with a different key or adding a pre-processing step to distribute hot keys across partitions.

Fetch Configuration Tuning

Consumer fetch settings directly affect throughput:

# Increase max bytes per fetch to reduce round trips
fetch.max.bytes=52428800
max.partition.fetch.bytes=10485760

# Increase max poll records for batch efficiency
max.poll.records=1000

# Reduce fetch wait time for lower latency
fetch.max.wait.ms=100

Larger fetch sizes improve throughput at the cost of memory. Smaller wait times improve latency at the cost of more network requests.

Handling Strategies

Scale Out the Bottleneck

If the sink is the bottleneck, increase its parallelism. In Flink, set the sink operator’s parallelism independently:

stream
    .map(new TransformFunction())
    .setParallelism(4)
    .addSink(new SnowflakeSink())
    .setParallelism(8);  // double the sink parallelism

For Kafka consumers, increase the number of partitions and consumer instances (consumer parallelism cannot exceed partition count).

Optimize Transformations

Profile your processing logic. Common wins include:

Cache external lookups - If you enrich records by calling an API, cache results with a TTL.
Reduce serialization - Avoid repeated serialize/deserialize cycles between operators.
Pre-aggregate - If a downstream system is slow, reduce the volume it receives by aggregating upstream.

Batch Writes to the Destination

Many destinations perform significantly better with batched writes:

# Instead of single-row inserts:
for record in batch:
    execute("INSERT INTO table VALUES (?)", record)

# Use bulk operations:
execute_batch("INSERT INTO table VALUES (?)", batch, batch_size=5000)

Snowflake, BigQuery, and Elasticsearch all show 10-50x throughput improvements with properly sized batches.

Add Buffering

Use Kafka as an explicit buffer between pipeline stages. If your transformation is fast but your sink is slow, write transformed records to an intermediate topic and let the sink consumer read at its own pace. This isolates the slow component from the rest of the pipeline.

Practical Example: CDC Pipeline with a Destination Bottleneck

Consider a common scenario: a PostgreSQL source with CDC streaming changes through Kafka to a Snowflake destination.

During normal operation, the pipeline handles 2,000 changes per second with sub-10-second end-to-end latency. Then a batch job at the source performs a bulk update affecting 5 million rows.

Kafka absorbs the initial burst - The producer writes all 5 million change events to Kafka within minutes. Kafka’s retention provides a buffer.
Snowflake cannot keep up - The warehouse processes bulk inserts at 3,000 rows per second, but changes are arriving at 20,000 per second. Consumer lag begins growing.
Flink shows backpressure - The sink operator shows HIGH busy time. All upstream operators show HIGH backpressure.
Resolution - Increase Snowflake warehouse size from X-Small to Medium for the duration of the backfill, increase sink batch size from 1,000 to 10,000 rows, and add a second sink task. Throughput rises to 25,000 rows per second. Lag drains within 15 minutes.

Streamkap’s managed CDC platform handles this scenario by monitoring consumer lag and sink throughput, automatically adjusting batch sizes and scaling processing capacity when it detects sustained backpressure.

Preventing Backpressure

Capacity Planning

Know your steady-state and peak throughput requirements. Size your destination for peak, not average. A warehouse that handles normal load comfortably may buckle during end-of-day batch processing or promotional events.

Load Testing

Simulate peak volumes before they hit production:

# Generate synthetic load to test pipeline capacity
kafka-producer-perf-test.sh \
  --topic test-topic \
  --num-records 1000000 \
  --record-size 1024 \
  --throughput 50000 \
  --producer-props bootstrap.servers=localhost:9092

Measure end-to-end latency and consumer lag under sustained load. If lag grows, you know your ceiling.

Auto-Scaling

Configure your pipeline to scale horizontally in response to lag growth. This means having enough Kafka partitions to allow additional consumers, using auto-scaling sink infrastructure (serverless warehouses, elastic clusters), and monitoring lag thresholds that trigger scale-up actions.

When Backpressure Is Acceptable

Not all backpressure requires intervention. It is a healthy response in several situations:

Transient spikes - A burst of activity that resolves within minutes. Kafka’s buffering absorbs the spike, lag drains naturally, and no action is needed.
Planned backfills - Initial data loads or historical replays will naturally cause temporary backpressure. As long as progress is steady and you are not at risk of exceeding retention, let it run.
Graceful degradation - During a partial outage (destination maintenance, network issues), backpressure prevents data loss by slowing the pipeline rather than failing. Once the issue resolves, the pipeline catches up from the Kafka buffer.

The threshold for action depends on your latency requirements. If your SLA allows 5 minutes of latency and a spike causes 2 minutes, monitoring is sufficient. If your use case demands sub-second delivery, any sustained backpressure requires immediate response.

The key is distinguishing between backpressure as a temporary safety mechanism and backpressure as a symptom of a fundamental capacity mismatch. The former resolves on its own. The latter requires architectural changes - more parallelism, faster destinations, or reduced data volume through filtering and aggregation upstream.

Ready to stop worrying about backpressure in your CDC pipelines? Streamkap’s managed platform monitors backpressure across all components and auto-scales processing capacity to handle volume spikes. Start a free trial or learn more about stream processing on Streamkap.

Products

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company