<--- Back to all resources
Stream Processing for Data Engineers: What You Need to Know
A practical guide to stream processing for data engineers moving from batch to real-time. Covers the mental model shift, key concepts like event time, watermarks, windows, and state, and when streaming actually beats batch.
Most data engineers learn their craft in a batch-first world. You write a SQL query, schedule it to run every hour (or every night), and the results land in a table that analysts query the next morning. This model works well for many use cases, but it falls apart when the business needs answers in seconds rather than hours. Stream processing fills that gap, and understanding it is becoming a required skill for any data engineer working with modern data stacks.
This guide covers the mental model shift from batch to streaming, breaks down the key concepts you need to know, and helps you figure out when streaming is actually the right tool for the job.
The Mental Model Shift
In batch processing, you think in terms of bounded datasets. A table has a known number of rows. A file has a beginning and an end. You read the whole thing, transform it, and write the output. The input is finite, and your job finishes.
In stream processing, the input is unbounded. Records keep arriving, and your job never finishes. There is no “read the whole thing” because the whole thing does not exist yet. Instead, you process each record (or small group of records) as it arrives and produce output incrementally.
This distinction sounds simple, but it changes how you think about nearly everything:
- Completeness. In batch, you know you have all the data before you process it. In streaming, you never have all the data. You have to decide when you have “enough” data to produce a result.
- Ordering. In batch, you can sort your input any way you want. In streaming, data arrives in whatever order the network and source systems produce, which is often not the order it was generated.
- Reprocessing. In batch, rerunning a job is trivial. Point it at the same input files and run it again. In streaming, reprocessing means replaying data from a log (like Kafka) and managing the state that your job has accumulated.
- Failure recovery. A batch job that fails can be restarted from scratch. A streaming job that fails needs to resume from exactly where it left off, with its in-memory state intact, or you risk duplicate or missing output.
Once you internalize this shift, the concepts that follow will make a lot more sense.
Event Time vs. Processing Time
This is the single most important concept in stream processing, and getting it wrong leads to silently incorrect results.
Processing time is the wall-clock time when your stream processor handles a record. It is simple to work with because it is always available and always moving forward. But it is unreliable. Network delays, consumer restarts, and backpressure all cause records to be processed at a different time than when they were created.
Event time is the timestamp embedded in the record itself, representing when the event actually happened at the source. An order placed at 14:32:07 has that timestamp baked in, regardless of whether your stream processor handles it at 14:32:08 or 14:35:00.
Consider a simple example. You want to count the number of orders per minute. If you use processing time, a consumer restart that causes a 3-minute backlog will suddenly dump thousands of records into the “current” minute, producing a massive spike in your count that never actually happened. If you use event time, those records are attributed to the minute they were actually placed, and your counts stay correct.
The rule is straightforward: use event time whenever your source data has a reliable timestamp. Processing time is only appropriate when you genuinely do not care when the event occurred, only when you saw it.
Watermarks: Deciding When Data Is “Complete Enough”
If data always arrived in order and on time, you would not need watermarks. But in the real world, data arrives late. A mobile app might buffer events and send them in a burst when the device reconnects. A microservice might retry a failed event minutes after the original attempt.
A watermark is a signal that says, “I believe I have received all events with an event time earlier than this point.” It is a heuristic, not a guarantee. When the watermark advances past time T, the stream processor assumes it will not see any more events with an event time before T and can finalize any computations that depend on data up to time T.
In practice, watermarks are configured with an allowed lateness parameter. If you set allowed lateness to 30 seconds, the system will wait 30 seconds past the expected event time before closing a window. Late arrivals within that 30-second buffer are included in the computation. Late arrivals after that buffer are either dropped or sent to a side output for separate handling.
Choosing the right lateness threshold is an engineering tradeoff:
- Too short and you drop legitimate late data, producing incomplete results.
- Too long and you hold results in memory longer, increasing latency and resource usage.
For most operational use cases, a lateness of 10 to 60 seconds strikes a good balance. For analytics where precision matters more than speed, you might extend this to several minutes. The key is to measure how late your data actually arrives and set the threshold based on observed behavior, not guesswork.
Windowing: Grouping Unbounded Data
In batch SQL, you group data with GROUP BY and the engine processes the entire dataset. In streaming, you need a way to slice the unbounded stream into finite chunks you can aggregate over. That is what windows do.
Tumbling Windows
A tumbling window is a fixed-size, non-overlapping time interval. Every event belongs to exactly one window.
Window 1: [00:00, 00:05)
Window 2: [00:05, 00:10)
Window 3: [00:10, 00:15)
Use tumbling windows when you want regular, non-overlapping summaries. “Count of orders per 5-minute interval” is a classic tumbling window use case.
Sliding (Hopping) Windows
A sliding window has a fixed size but advances by a smaller step, so windows overlap. A 10-minute window that slides every 1 minute produces a new result every minute, each covering the last 10 minutes of data.
Use sliding windows when you want a moving average or a rolling count. “Average response time over the last 10 minutes, updated every minute” is a natural fit.
Session Windows
Session windows group events that are close together in time, with a configurable gap duration. If no event arrives within the gap, the window closes. The next event starts a new window.
Session windows are ideal for user activity analysis. If a user clicks around your site for 8 minutes, goes idle for 15 minutes, and then comes back, a session window with a 10-minute gap would produce two separate sessions.
Global Windows
A global window collects all events for a given key into a single window that never closes on its own. You pair it with a custom trigger (e.g., “fire after every 100 events” or “fire when a specific event type arrives”). Global windows are useful for accumulating state indefinitely but require careful management to avoid unbounded memory growth.
State Management: The Hard Part
Stateless stream processing is easy. Filter a record, transform a field, route an event. No memory required between records.
The hard part is stateful processing: aggregations, joins, deduplication, pattern matching. All of these require the processor to remember something about the records it has already seen. And because a streaming job runs indefinitely, that state can grow very large.
What State Looks Like
State is typically a key-value store embedded in the stream processor. For a “count orders per customer” aggregation, the state might be a map from customer_id to count. Every time a new order arrives, the processor looks up the current count, increments it, and writes it back.
In Apache Flink, state is managed by the framework and backed by RocksDB (an embedded key-value store) for large state sizes. Kafka Streams uses a similar approach with its state stores, backed by RocksDB and changelog topics in Kafka.
Checkpointing and Fault Tolerance
If a stream processor crashes, its in-memory state is lost. To recover without reprocessing the entire stream from the beginning, the processor periodically takes checkpoints: snapshots of all state, along with the current position in the input stream.
When the processor restarts, it loads the most recent checkpoint and resumes processing from the corresponding position. This gives you exactly-once processing semantics, meaning each input record affects the output exactly once, even in the presence of failures.
Checkpoint frequency is another tradeoff. Frequent checkpoints mean faster recovery but higher I/O overhead during normal operation. A checkpoint interval of 30 seconds to 2 minutes is common for production workloads. Flink’s incremental checkpointing reduces overhead by only saving state that changed since the last checkpoint.
State Cleanup
State that grows without bound will eventually exhaust your memory or disk. You need a strategy for cleaning up state that is no longer needed. Common approaches include:
- State TTL (time-to-live). Automatically expire state entries that have not been updated within a configured period. Good for deduplication and sessionization.
- Window-based cleanup. When a window closes and its result is emitted, the associated state can be discarded.
- Manual cleanup. In some cases, you need application logic to decide when state is no longer relevant.
When Streaming Beats Batch
Streaming is not universally better than batch. It is more complex to build, harder to debug, and more expensive to operate. The question is always whether the value of fresh data exceeds the additional cost.
Here are the situations where the answer is clearly yes:
Fraud and Anomaly Detection
A batch job that runs hourly will detect fraud an hour after it happens. For payment fraud, that is an hour of unauthorized transactions. Stream processing lets you score each transaction as it occurs and block suspicious ones in real time. The difference between sub-second and hourly detection can be measured in dollars.
Operational Monitoring and Alerting
If your application is throwing errors at an increasing rate, you want to know now, not when the next batch job runs. Streaming pipelines that aggregate error rates over sliding windows can trigger alerts within seconds.
Real-Time Personalization
Recommendation engines that update based on a user’s current session behavior (what they are browsing, what they added to their cart) outperform models trained only on historical batch data. Streaming enables features like “customers who are currently viewing this also looked at…” which require processing events as they happen.
CDC-Powered Data Replication
Change data capture (CDC) is inherently a streaming pattern. Row-level changes flow continuously from the source database. Applying these changes to a destination in real time keeps your warehouse, cache, or search index fresh. Streamkap uses CDC with stream processing to deliver data to destinations like Snowflake, BigQuery, and ClickHouse with sub-minute latency. Trying to replicate the same freshness with batch ETL means running jobs every few minutes, which is fragile and expensive.
Multi-System Synchronization
When multiple systems need to react to the same event (an order is placed, so the inventory system decreases stock, the notification system sends an email, and the analytics system updates a dashboard), streaming gives you a single source of events that each system consumes independently. Batch approaches require each system to poll for changes on its own schedule, leading to inconsistency.
When Batch Is Still the Right Choice
Streaming adds complexity, and that complexity needs to earn its keep. Stick with batch when:
- Data freshness is not valuable. If your analysts look at yesterday’s data every morning and that is perfectly fine, a nightly batch job is simpler and cheaper.
- Your processing logic requires the full dataset. Some computations, like training a machine learning model on all historical data, are inherently batch operations.
- You are joining data from many slow-moving sources. If your sources update once a day, streaming is just batch with extra steps.
- Your team does not have streaming experience. The operational cost of debugging a stuck Flink job at 2 AM is real. If the freshness gain is marginal, the complexity is not worth it.
Getting Started: A Practical Path
If you are a data engineer starting with stream processing, here is a progression that has worked well for many teams:
1. Start with a Managed CDC Pipeline
Before you write any stream processing code, get comfortable with real-time data flow. Set up a CDC pipeline from your primary database to your warehouse using a managed platform like Streamkap. This gets real-time data into your stack without requiring you to manage Kafka, Debezium, or Flink yourself. You will build intuition for event ordering, schema evolution, and data freshness.
2. Add Simple Transformations
Once data is flowing, add lightweight transformations. Filter out test data, rename columns, mask PII fields. Many CDC platforms (including Streamkap) support single message transforms (SMTs) that run in the pipeline without a separate processing framework.
3. Learn a Stream Processing Framework
When you need more than simple transforms (aggregations, joins, windowed computations), pick a framework and go deep. Apache Flink is the strongest general-purpose option. Its SQL API lets you write streaming queries that look similar to batch SQL, which flattens the learning curve for data engineers.
-- Flink SQL: count orders per customer in 5-minute tumbling windows
SELECT
customer_id,
TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
COUNT(*) AS order_count
FROM orders
GROUP BY
customer_id,
TUMBLE(event_time, INTERVAL '5' MINUTE);
Kafka Streams is another solid choice if your data is already in Kafka and you want a lightweight library that embeds in your Java or Kotlin application without a separate cluster.
4. Build Observability from Day One
Streaming jobs are long-running, and problems compound over time. Instrument your pipelines from the start:
- Consumer lag. How far behind is your processor from the latest event? Rising lag means your job cannot keep up.
- Checkpoint duration and size. Increasing checkpoint times indicate growing state or I/O problems.
- Event throughput. Records in and out per second, broken down by source and destination.
- Watermark progress. If watermarks stop advancing, your windows are stuck and no output is being produced.
- Backpressure. In Flink, backpressure metrics tell you which operator is the bottleneck.
Set alerts on these metrics early. A streaming pipeline that silently falls behind is worse than a batch job that visibly fails.
5. Plan for Schema Evolution
Schemas change. Columns get added, types get altered, tables get renamed. In batch processing, you update the query and rerun. In streaming, the job is running continuously and needs to handle the new schema without stopping.
Use a schema registry (Confluent Schema Registry or AWS Glue Schema Registry) to manage schema versions. Configure your serializers to use backward-compatible schema evolution so that new fields have defaults and removed fields are handled gracefully. Streamkap handles schema evolution automatically, propagating column additions and type changes to the destination without pipeline restarts.
The Bigger Picture
Stream processing is not a replacement for batch. It is a different tool for a different class of problems. The modern data stack increasingly requires both: batch for historical analysis and model training, streaming for operational intelligence and real-time replication.
The good news for data engineers is that the tooling has matured significantly. You no longer need to build everything from scratch. Managed platforms handle the infrastructure, Flink SQL brings streaming closer to the SQL you already know, and CDC tools like those from Streamkap make it easy to get real-time data flowing without a months-long infrastructure project.
Start small, build intuition, and invest in the areas where fresh data has clear business value. That is the pragmatic path from batch-first to streaming-capable.