<--- Back to all resources

Engineering

February 25, 2026

11 min read

Stream Processing Anti-Patterns: 10 Mistakes to Avoid

Common stream processing mistakes that cause production outages, data loss, and performance problems. Learn what not to do with Kafka, Flink, and real-time pipelines.

TL;DR: • Unbounded state from missing time bounds on joins and aggregations is the number one production killer. • Treating streams like batch by buffering everything defeats the purpose of stream processing. • Ignoring backpressure signals leads to cascading failures across the pipeline. • Skipping schema evolution planning guarantees painful breaking changes down the road.

I have spent the better part of a decade building, breaking, and fixing stream processing pipelines. The patterns that cause failures in production are remarkably consistent. Different teams, different tech stacks, different industries - but the same mistakes show up again and again.

This article covers the ten anti-patterns I have seen cause the most damage. Some of these will seem obvious in hindsight. That is the nature of production incidents - they always look obvious after the postmortem. The goal here is to help you spot them before they wake you up at 3 AM.

1. Unbounded State Growth

This is the single most common killer of stream processing jobs. It happens when you write a stateful operation - a join, an aggregation, a deduplication window - without setting time bounds on how long state is retained.

What it looks like in practice: Your Flink or Kafka Streams job runs perfectly for days, maybe even weeks. Checkpoint sizes creep up slowly. Then one morning, the job crashes with an OutOfMemoryError or checkpoints start timing out. You restart it, and it crashes faster because it tries to restore all that accumulated state.

Why it happens: It is easy to write a join between two streams and forget to specify a time window. The framework dutifully holds onto every event from both sides, waiting for a match that might never come. For aggregations, developers often build running totals without considering that the key space grows forever.

How to fix it:

  • Always define time bounds on joins. In Flink SQL, use INTERVAL clauses. In the DataStream API, use windowed joins or set state TTL.
  • Configure state TTL (time-to-live) as a safety net. Even if your logic should clean up state, TTL catches the cases you missed.
  • Monitor checkpoint sizes over time. A steadily growing checkpoint is a ticking time bomb.
  • Use RocksDB state backend for large state so you are spilling to disk instead of holding everything in the JVM heap.
-- Bad: unbounded join, state grows forever
SELECT * FROM orders o
JOIN payments p ON o.order_id = p.order_id;

-- Good: time-bounded join, state is bounded
SELECT * FROM orders o
JOIN payments p ON o.order_id = p.order_id
  AND p.event_time BETWEEN o.event_time
  AND o.event_time + INTERVAL '1' HOUR;

2. Treating Streams Like Batch

This anti-pattern shows up when teams migrate from batch ETL to streaming but bring their batch mindset with them. They collect events into a buffer, wait until they have “enough,” then process them all at once.

What it looks like: Your streaming job has a massive tumbling window - say 15 minutes or an hour - that collects everything and then fires once. Or worse, there is a custom accumulator that holds events in memory until some arbitrary threshold is hit before flushing them downstream.

Why it is a problem: You have built a batch job with extra steps. You are paying for always-on compute but only getting periodic results. You also create a single point of failure: if the job crashes mid-window, you lose everything that was buffered. And your downstream consumers get data in bursts instead of a steady flow, which causes its own problems with resource spikes.

How to fix it:

  • Process events individually or in small micro-batches (seconds, not minutes).
  • If you need aggregations over time, use sliding or session windows and emit partial results with early firing triggers.
  • When loading into a data warehouse, use small frequent batches rather than one large dump. Platforms like Streamkap handle this for you by continuously streaming CDC data to your destination with low latency, eliminating the temptation to buffer.
  • If you catch yourself saying “we need to wait for all the data,” ask whether you really do, or whether incremental updates would work.

3. Ignoring Backpressure

Backpressure is the mechanism by which a slow consumer tells upstream producers to slow down. Ignoring it is like ignoring the oil pressure light on your car dashboard. Things work fine until they do not, and when they fail, they fail catastrophically.

What it looks like: Your Kafka consumer lag grows steadily. Your Flink job’s backpressure metrics show operators at 100% utilization, but the source keeps reading at full speed because there is an async boundary in between. Eventually, internal buffers fill up, memory is exhausted, and the job dies or starts dropping events.

Why it is dangerous: Backpressure signals exist for a reason. When a sink cannot keep up - maybe your database is under load, or an HTTP endpoint is slow - the correct behavior is to slow down the entire pipeline. If you break this signal by, say, dumping events into an unbounded in-memory queue, you are trading a slow pipeline for a dead pipeline.

How to fix it:

  • In Flink, backpressure propagates automatically through the network buffers. Do not fight it. If you see persistent backpressure, fix the bottleneck instead of adding buffering in front of it.
  • Use bounded queues for any async operations. When the queue is full, block the producer instead of growing the queue.
  • Set sensible timeout and retry policies on external calls. Infinite retries with no backoff will grind everything to a halt.
  • Monitor consumer lag in Kafka as an early warning. If lag is growing, you need more consumer instances, faster processing, or a conversation with whoever owns the slow downstream system.

4. No Schema Evolution Plan

Schemas change. Fields get added, renamed, deprecated. Data types change. If you have not planned for schema evolution, your first schema change will be your first production incident.

What it looks like: Someone adds a new field to a Kafka topic. Half the consumers crash because they are using strict deserialization that rejects unknown fields. Or a field changes from int to long, and downstream aggregations silently produce wrong results because of type coercion quirks.

Why it happens: In batch, schema changes are painful but manageable - you update the schema, reprocess the data, and move on. In streaming, you have a continuous flow of data in the old format mixed with data in the new format. Producers and consumers deploy at different times. You cannot just “reprocess everything” without significant effort.

How to fix it:

  • Use a schema registry (Confluent Schema Registry, AWS Glue, etc.) from day one. It enforces compatibility rules and tracks versions.
  • Default to backward-compatible changes: add optional fields, do not remove or rename existing ones.
  • Write your processing logic to tolerate missing fields and unknown fields. Never assume the schema is exactly what you expect.
  • Test schema changes in staging before production. Deploy consumers first (they can handle the new schema), then deploy producers.
  • For breaking changes, use a new topic rather than trying to migrate in place.

5. Tight Coupling Between Producer and Consumer

When the producer and consumer are developed by the same team, it is tempting to treat the stream like a function call - the producer knows exactly what the consumer expects, and vice versa. This works until a second consumer appears, or the teams split, or you need to replay historical data.

What it looks like: The consumer relies on events arriving in a specific order from a specific producer. The event format is defined in a shared code library rather than a schema. Adding a new consumer requires changes to the producer. Replaying data from an earlier offset breaks the consumer because it depends on the producer’s current behavior.

Why it is a problem: Tight coupling negates one of the primary benefits of event streaming - decoupling producers from consumers so they can evolve independently. When you couple them, you create a distributed monolith that is harder to operate than a regular monolith.

How to fix it:

  • Define events as contracts, not implementation details. Use schemas (Avro, Protobuf, JSON Schema) that are versioned independently of the code.
  • Design events to be self-contained. A consumer should be able to process an event without knowing what other events the producer has sent recently.
  • Avoid relying on event ordering across partitions. If you need ordering, partition by the relevant key so that related events land in the same partition.
  • Test consumers against recorded event streams, not live producers. This forces your consumer to work with the data as-is rather than relying on assumptions about the producer.

6. Not Testing with Realistic Data Volumes

Your stream processing job works great in dev with 100 events per second. Production runs at 50,000 events per second with occasional spikes to 200,000. These are different universes.

What it looks like: The job passes all unit and integration tests. It runs fine in staging with a trickle of synthetic data. On launch day, it immediately falls behind. Checkpoints fail because state is too large. The sink cannot handle the write throughput. Serialization overhead that was invisible at low volumes now dominates CPU usage.

Why it happens: Realistic load testing for streaming is hard. You need sustained throughput over hours, not just a burst. You need realistic key distributions (some keys are hot, most are cold). You need realistic event sizes and shapes. Most teams skip this because it is expensive and time-consuming to set up.

How to fix it:

  • Build a load testing environment that mirrors production topology: same number of partitions, same parallelism, same sink configuration.
  • Use production data (anonymized if necessary) for load tests, not synthetic data. Synthetic data tends to have uniform key distribution, which hides hot-key problems.
  • Run load tests for hours, not minutes. Many streaming problems only manifest after state has accumulated or caches have filled up.
  • Test your failure scenarios under load: What happens when a TaskManager dies? What happens when the sink goes down for 5 minutes? How long does recovery take?

7. Ignoring Event Time in Favor of Processing Time

Processing time is when your system sees the event. Event time is when the event actually happened. Using processing time for anything time-sensitive is a recipe for incorrect results.

What it looks like: Your hourly revenue dashboard shows a spike every time the pipeline recovers from a brief outage. That is because all the events that arrived during the outage get processed in a burst, and since you are using processing time, they all land in the “current” window instead of the windows where they belong.

Why it matters: Events arrive late. Pipelines restart. Consumers fall behind and catch up. In all of these cases, processing time diverges from reality. If you are computing aggregations, joining streams, or doing any kind of time-based reasoning, processing time will give you wrong answers that look right most of the time - which is worse than wrong answers that look wrong.

How to fix it:

  • Use event time for all time-based operations. This is a one-line configuration in Flink (setStreamTimeCharacteristic(TimeCharacteristic.EventTime)) but has deep implications.
  • Implement watermark strategies that match your data’s lateness profile. Start with BoundedOutOfOrdernessTimestampExtractor and tune the allowed lateness based on observed data.
  • Handle late events explicitly. Decide up front: do you drop them, route them to a side output, or update previously emitted results?
  • Include a timestamp in every event at the source. If the event does not carry its own timestamp, you have already lost. Services like Streamkap that capture changes directly from database transaction logs preserve the original event timestamps, which gives you accurate event-time processing downstream.

8. Over-Partitioning or Under-Partitioning

Partition count in Kafka (or parallelism in Flink) determines how much you can scale. Get it wrong in either direction and you pay the price.

What it looks like with too few partitions: You cannot add more consumer instances because the number of consumers is limited by the number of partitions. One slow partition holds up the entire consumer group. You hit per-partition throughput limits.

What it looks like with too many partitions: Broker metadata overhead grows. Leader elections take forever during rolling restarts. End-to-end latency increases because each partition has its own buffer that needs to fill before flushing. You have thousands of partitions but most are nearly empty.

How to fix it:

  • Start with a partition count based on your target throughput divided by the throughput you can achieve per partition. For Kafka, a single partition can typically handle 10-50 MB/s depending on message size and replication.
  • Remember that you can increase partition count later, but you cannot decrease it without creating a new topic.
  • For Flink parallelism, match the number of source operators to the number of Kafka partitions. Having more parallel source instances than partitions just wastes resources.
  • Watch for hot keys. If one partition gets disproportionate traffic, rethink your partitioning strategy rather than adding more partitions.

9. No Dead Letter Queue for Bad Data

In batch processing, a bad record fails the job, you fix it, and rerun. In streaming, a bad record can block an entire partition forever if you do not handle it. A dead letter queue (DLQ) is not optional - it is a survival mechanism.

What it looks like without a DLQ: A single malformed JSON event causes your deserializer to throw an exception. The consumer retries, hits the same event, throws again. The consumer is stuck in a crash loop. That partition stops making progress. Meanwhile, good events pile up behind the bad one, and lag grows until someone notices.

Why it happens: Teams often skip error handling during initial development because the data is clean in testing. Then production introduces events with null fields, unexpected types, encoding issues, or truncated payloads.

How to fix it:

  • Route events that fail deserialization or processing to a dead letter topic. Include the original event bytes, the error message, the stack trace, and a timestamp.
  • Set up alerts on the DLQ. If events are landing there, you need to know about it quickly.
  • Build tooling to replay events from the DLQ after fixing the bug. This is your recovery path.
  • Decide on a policy: how many DLQ events per minute is acceptable before you consider the pipeline unhealthy? One bad event per hour is noise. A thousand per minute means something is fundamentally broken.
// Simple DLQ pattern in Kafka Streams
.mapValues(value -> {
    try {
        return parse(value);
    } catch (Exception e) {
        // Send to dead letter topic
        dlqProducer.send(new ProducerRecord<>(
            "my-app.dlq",
            value,
            e.getMessage()
        ));
        return null; // Filter nulls downstream
    }
})
.filter((key, value) -> value != null)

10. Skipping Monitoring and Alerting

Building a streaming pipeline without monitoring is like driving at night with the headlights off. You will find out about problems when you hit them, which in production means when your users or your boss tells you.

What it looks like: The pipeline has been silently dropping 5% of events for three days because a sink timeout was misconfigured. Nobody noticed because there are no alerts on throughput, no end-to-end data quality checks, and the only monitoring is a Grafana dashboard that nobody looks at.

What you need to monitor:

  • Consumer lag: The most basic streaming metric. If lag is growing, your pipeline is falling behind.
  • Throughput (in and out): Compare events entering the pipeline to events leaving it. A mismatch means data is being lost or duplicated.
  • Error rates: Deserialization failures, processing exceptions, sink write failures. All of these should be tracked and alerted on.
  • Checkpoint duration and size (Flink): Growing checkpoint times indicate growing state or I/O bottlenecks.
  • End-to-end latency: Measure the time from when an event is produced to when it is fully processed and written to the sink.

How to fix it:

  • Instrument your pipeline from day one. Use Prometheus, Datadog, or whatever your team already uses. Do not build your own metrics system.
  • Set alerts on consumer lag, error rate spikes, and throughput drops. These three alone will catch the majority of production issues.
  • Build end-to-end health checks that send a synthetic event through the entire pipeline and verify it arrives at the destination. If your pipeline is managed by a platform like Streamkap, you get built-in monitoring and alerting on pipeline health, lag, and throughput without having to wire it all up yourself.
  • Review your dashboards weekly. Trends matter more than point-in-time values. A slowly growing checkpoint size is not an emergency today, but it will be next month.

The Pattern Behind the Patterns

If you look at these ten anti-patterns, a theme emerges: most of them come from applying batch thinking to a streaming world. Unbounded state, buffering everything, ignoring time semantics, skipping error handling for individual records - these are all habits that work in batch and break in streaming.

The shift to stream processing is not just a technology change. It is a mindset change. You have to think about unbounded datasets, continuous operation, and graceful degradation. You have to plan for the fact that your job will run for months without a restart, that data will arrive late, and that schemas will change under you.

Start by addressing the anti-patterns that are easiest to detect: add monitoring and alerting, set up a dead letter queue, and put time bounds on your stateful operations. These three changes alone will prevent the majority of streaming production incidents I have seen. Then work through the rest as you mature your pipeline. The investment pays off every time you sleep through the night instead of debugging a 3 AM outage.