<--- Back to all resources
Stream Processing Architecture: Patterns for Real-Time Data Systems
An architect's guide to stream processing patterns - Lambda, Kappa, event sourcing, CQRS, materialized views, and exactly-once semantics - with decision frameworks for each.
The phrase “real-time data system” covers a spectrum from operational databases refreshed every few seconds to sub-millisecond event processing pipelines. The architectural patterns that work at one end of this spectrum fail at the other. Senior engineers and architects designing real-time systems must understand not just what patterns exist, but when each pattern applies - and what it costs.
This guide covers the six most important architectural patterns for stream processing systems, with decision frameworks, tradeoffs, and the questions you need to answer before choosing.
Setting the Context: What Is Stream Processing?
Stream processing is the continuous computation of results over an unbounded sequence of events. Unlike batch processing - which operates on a bounded, complete dataset - stream processing must produce results incrementally as events arrive, without waiting for a “final” dataset.
This creates a set of fundamental tensions that every streaming architecture must resolve:
- Completeness vs. latency: Waiting longer produces more accurate results; acting faster requires accepting incomplete information
- Consistency vs. availability: Guaranteeing exactly-once processing requires coordination that adds latency and reduces availability
- Complexity vs. capability: More powerful processing semantics require more complex infrastructure
Every pattern in this guide represents a different resolution of these tensions.
Pattern 1: Lambda Architecture
Lambda architecture, formalized by Nathan Marz circa 2012, addresses a core problem: stream processing at the time was not accurate enough for analytical workloads, but batch processing was too slow for operational use cases. Lambda’s solution: run both.
Structure
Raw Events ──► Batch Layer (Hadoop, Spark)
│ │
│ ▼
│ Batch Views (complete, accurate, slow)
│ │
└──► Speed Layer (Storm, Spark Streaming) ──► Serving Layer ──► Query
│ │
Real-time Views (incomplete, fast) │ (merge at query time)
Batch layer: Processes the complete event history. Results are accurate because they operate on a full, immutable dataset. Latency: hours.
Speed layer: Processes only the most recent events not yet covered by the batch layer. Results are approximate or incomplete. Latency: seconds.
Serving layer: Merges results from both layers at query time. The query returns the batch result for old data and the speed-layer result for recent data.
When Lambda Makes Sense
- When your stream processing logic cannot be expressed incrementally (global top-N rankings, exact distinct counts, ML inference over full history)
- When batch accuracy requirements are regulatory or contractual
- When the team has existing investment in a batch processing platform
Why Lambda Falls Out of Favor
The fundamental problem with Lambda is code duplication. You write every computation twice: once in the batch framework (Spark, Hive), once in the stream framework (Storm, Kafka Streams). These implementations inevitably diverge over time. Bug fixes applied to one path are not automatically applied to the other. The “query time merge” logic adds additional complexity.
Modern stream processors have largely eliminated the accuracy gap that motivated Lambda. Apache Flink with event-time processing, watermarks, and replayable Kafka topics can produce batch-accurate results from a streaming computation. For most new systems in 2026, Lambda is an antipattern.
Pattern 2: Kappa Architecture
Kappa architecture, proposed by Jay Kreps in 2014, eliminates Lambda’s batch layer entirely. The insight: if you have a replayable log of immutable events (Kafka, a durable message queue), you can always reprocess history by replaying from the beginning. There is no need for a separate batch layer.
Structure
Immutable Event Log (Kafka)
│
│ Stream Processing (Flink, Kafka Streams)
▼
Serving Layer ──► Query
For reprocessing (bug fix, backfill, algorithm change), start a new consumer group from the beginning of the log:
Immutable Event Log (Kafka, long retention)
├── Consumer Group A (current version, reading from offset N)
└── Consumer Group B (new version, replaying from offset 0)
│ (when B catches up to A, switch over)
Kappa Requirements
Kappa architecture makes a strong assumption: the event log is complete and replayable. This requires:
- Long retention in Kafka (days to months, not hours)
- Immutable events: never mutate past events; append-only
- Deterministic processing: same input produces same output on replay
- Stream-capable logic: all business logic can be expressed as incremental computation
Lambda vs. Kappa Decision Matrix
| Consideration | Lambda | Kappa |
|---|---|---|
| Processing logic incrementally expressible | Optional | Required |
| Duplicate code maintenance | Yes | No |
| Historical reprocessing | Fast (batch) | Possible (replay, slower) |
| Operational complexity | High (two layers) | Medium (one layer) |
| Recommended for new systems (2026) | Only if needed | Default choice |
Pattern 3: Event Sourcing
Event sourcing inverts the traditional relationship between state and history. In a standard CRUD database, you store current state and discard the operations that produced it. In event sourcing, you store the operations (events) as the primary record, and derive current state by replaying them.
Structure
Command ──► Event Store (append-only log of immutable events)
│
│ Projection (replay events to build state)
▼
Current State (derived, rebuildable)
A bank account in event sourcing:
Events (the source of truth):
- AccountOpened(id=101, initial_balance=0)
- MoneyDeposited(id=101, amount=500, ts=2024-01-10)
- MoneyWithdrawn(id=101, amount=200, ts=2024-01-15)
- MoneyDeposited(id=101, amount=1000, ts=2024-01-20)
Current state (derived by replaying events):
balance = 0 + 500 - 200 + 1000 = 1300
Benefits of Event Sourcing
Complete audit trail: Every state change is recorded with timestamp and context. This is not a separate audit log bolted on - it is the primary data model.
Temporal queries: “What was the account balance on January 12?” - replay events up to that date.
Event replay: Debug, test, and rebuild downstream systems by replaying the event log.
Decoupled downstream consumers: Any number of projections can be built from the same event stream without modifying the source application.
When Event Sourcing Adds Complexity Without Benefit
Event sourcing is not appropriate for all domains:
- High-frequency updates to the same entity: Rebuilding state by replaying 10 million events for a single account is impractical without snapshotting
- Simple CRUD applications: The added complexity is not justified if audit trails and temporal queries are not requirements
- Teams unfamiliar with the pattern: Event sourcing requires a significant mental model shift; the operational complexity is real
Snapshotting addresses the replay performance problem: periodically save a snapshot of current state, so replay only needs to cover events since the last snapshot.
Event Sourcing vs. CDC
Both capture state changes as events. The distinction:
- Event sourcing: The application writes events as its primary state store
- CDC: Infrastructure captures events from an existing database’s transaction log
CDC gives event-sourcing-like benefits (replayable changelog) to applications not designed with event sourcing. Change data capture pipelines can serve as the streaming foundation for downstream event-sourcing patterns.
Pattern 4: CQRS (Command Query Responsibility Segregation)
CQRS separates the write model (command side) from the read model (query side). Instead of a single database serving both reads and writes, the write path and read path use different, optimized stores.
Structure
Command ──► Write Model (normalized, ACID, optimized for writes)
│
│ Event / CDC stream
▼
Read Models (denormalized, optimized for specific query patterns)
├── Read Model A: User profile view (key-value store)
├── Read Model B: Search index (Elasticsearch)
└── Read Model C: Analytics aggregate (ClickHouse / Snowflake)
CQRS in Practice
A product catalog service using CQRS:
| Layer | Store | Purpose |
|---|---|---|
| Write model | PostgreSQL (normalized) | ACID writes, foreign key constraints |
| Read model: API | Redis (denormalized JSON) | Sub-millisecond product lookups |
| Read model: Search | Elasticsearch | Full-text search, faceted filtering |
| Read model: Analytics | ClickHouse | Category-level aggregations, pricing trends |
The read models are maintained by CDC pipelines that stream changes from PostgreSQL to each downstream store. When a product is updated in PostgreSQL, the change propagates to Redis, Elasticsearch, and ClickHouse within seconds.
CQRS Tradeoffs
Benefits:
- Read models optimized for specific access patterns - no index tuning compromises
- Write path isolated from read load
- Independent scaling of read and write tiers
- Natural fit for event sourcing (events on the write side fan out to read models)
Costs:
- Eventual consistency: read models lag behind the write model (typically seconds)
- Operational complexity: multiple stores to monitor, backup, and keep synchronized
- Code complexity: each read model requires a separate projection/synchronization pipeline
CQRS is most valuable when read and write patterns are highly divergent - high-volume reads with complex filtering, and transactional writes with integrity constraints. It is overkill for applications with simple, symmetric read/write patterns.
Pattern 5: Materialized Views
A materialized view is a pre-computed, stored result of a query. In streaming systems, materialized views are continuously updated as new events arrive - providing the query performance of a pre-aggregated snapshot with the freshness of a real-time stream.
Static vs. Streaming Materialized Views
| Type | Updated | Freshness | Example |
|---|---|---|---|
| Database materialized view | On schedule or on demand | Minutes to hours | Nightly summary tables |
| Streaming materialized view | Continuously, as events arrive | Seconds | Real-time dashboard aggregates |
Streaming Materialized View Patterns
Windowed aggregation: Count events within a time window.
-- Flink SQL: rolling 1-hour order count by region
SELECT
region,
TUMBLE_START(event_time, INTERVAL '1' HOUR) AS window_start,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue
FROM orders
GROUP BY
region,
TUMBLE(event_time, INTERVAL '1' HOUR)
Changelog view (upsert stream): Maintain a current-state view of a mutable entity.
CDC events (inserts, updates, deletes) ──► Flink stateful processing
│
▼
Current-state table (latest row per key)
│
▼
Redis / ClickHouse / Snowflake
Denormalized join view: Pre-join two tables into a single flattened record.
This is the real-time enrichment pattern applied to produce a persistent materialized view rather than an enriched event stream.
Where Materialized Views Live
| Destination | Use Case |
|---|---|
| Redis | Sub-millisecond key lookups; operational read models |
| ClickHouse | Real-time analytics with second-range query latency |
| Elasticsearch | Search and filtering over current state |
| Snowflake / BigQuery | Pre-aggregated warehouse tables for BI tools |
| RocksDB (embedded in Flink) | Internal stream processor state |
Pattern 6: Fan-Out and Fan-In
Fan-out distributes a single stream to multiple consumers or downstream topics. Fan-in merges multiple streams into a single consolidated stream.
Fan-Out
Source stream (CDC events)
│
├──► Topic: events.raw (archive, compliance)
├──► Topic: events.enriched (ML features, analytics)
├──► Topic: events.filtered (privacy-scrubbed, for third parties)
└──► Topic: events.aggregated (pre-summed metrics)
Fan-out is implemented by:
- Multiple consumer groups reading the same topic independently (simplest; consumers own their own derived state)
- Stream processor with multiple sinks (one Flink job writes to multiple output topics)
- Kafka Connect with multiple sink connectors (each connector reads the same topic and writes to a different destination)
Fan-In (Stream Merging)
Topic A: orders.us_east ──┐
Topic B: orders.us_west ──┤── Union ──► Topic: orders.global
Topic C: orders.eu ──┘
Fan-in is used to consolidate events from multiple regional or partitioned sources into a single global stream. Flink’s DataStream.union() or the SQL UNION ALL operator handles this. Key concern: event-time alignment - events from different sources may arrive with different clock skews, requiring watermark strategies that account for the slowest upstream partition.
Exactly-Once Semantics: Theory and Practice
Exactly-once processing guarantees that each event affects the output exactly once - no duplicates, no omissions - even in the presence of failures and retries. It is the most demanding delivery guarantee in stream processing.
The Three Delivery Guarantees
| Guarantee | Description | Failure behavior |
|---|---|---|
| At-most-once | Each event processed zero or one time | Data loss on failure |
| At-least-once | Each event processed one or more times | Duplicates on failure |
| Exactly-once | Each event processed exactly once | No data loss, no duplicates |
How Exactly-Once Is Achieved
End-to-end exactly-once requires coordination at three levels:
- Source: The source must be replayable (Kafka with committed offsets) and must not advance its offset until processing is committed downstream
- Processor: The stream processor must checkpoint state atomically - Flink’s distributed snapshots (Chandy-Lamport algorithm) achieve this
- Sink: The destination must support either idempotent writes or transactional commits coordinated with the processor’s checkpoint
Flink’s exactly-once mode coordinates checkpoints with Kafka offset commits and transactional sink writes using a two-phase commit protocol. The checkpoint interval (typically 1-30 seconds) determines the granularity of the exactly-once guarantee - and the latency floor.
When Exactly-Once Is Worth the Cost
| Operation | Idempotent? | Recommendation |
|---|---|---|
| Write row with primary key (Snowflake, Postgres) | Yes | At-least-once + idempotent write |
| Increment counter | No | Exactly-once required |
| Send notification / trigger webhook | No | Exactly-once, or deduplication at consumer |
| Compute deterministic aggregate | Yes (if state is checkpointed) | At-least-once sufficient |
| Financial calculation (debit/credit) | No | Exactly-once required |
The practical guidance: use at-least-once delivery with idempotent destination writes as the default. Reserve exactly-once for operations that are genuinely non-idempotent and where duplicate execution has real-world consequences.
Architecture Decision Framework
When designing a new stream processing system, work through these decisions in order:
Decision 1: Single-layer or two-layer?
Is your processing logic fully expressible as incremental computation?
- Yes → Kappa (single streaming layer)
- No (global aggregations, ML training, complex algorithms) → Lambda (streaming + batch layers)
Decision 2: Event sourcing or CDC?
Are you building a new application or integrating with existing databases?
- New application, audit trail is a core requirement → Event sourcing
- Existing databases, need streaming without app changes → CDC
Decision 3: Unified read/write model or CQRS?
Are read and write patterns symmetric or divergent?
- Symmetric (simple key lookups, no complex query patterns) → Single store, no CQRS
- Divergent (complex search, analytics, operational APIs) → CQRS with streaming materialized views
Decision 4: Exactly-once or at-least-once?
Are the operations in your pipeline idempotent?
- Yes → At-least-once with idempotent destination writes
- No → Exactly-once (accept higher latency and operational complexity)
Decision 5: Fan-out topology?
How many consumers need the stream?
- One → Direct stream; consumer owns its derived state
- Many, same schema → Multiple consumer groups on the same topic
- Many, different schemas or SLAs → Kafka fan-out with per-consumer derived topics
Putting It Together: A Reference Architecture
A production-grade real-time data platform for a mid-size engineering organization:
Source Systems (PostgreSQL, MySQL, MongoDB)
│
│ CDC (Streamkap)
▼
Kafka (immutable event log, 30-day retention)
│
┌────┴─────┐
│ │
▼ ▼
Flink Direct consumers
(Kappa (microservices reading
processing) raw CDC topics)
│
├──► Materialized views → Redis (operational read models)
├──► Enriched streams → Snowflake (analytics, CQRS read model)
├──► Aggregated metrics → ClickHouse (real-time dashboards)
└──► Search indexes → Elasticsearch (full-text search)
This architecture uses Kappa (no batch layer), CQRS (write in normalized source DBs, read from optimized derived stores), materialized views (pre-computed aggregates in ClickHouse), and fan-out (Kafka feeding both Flink and direct consumers). Delivery is at-least-once with idempotent writes to all destinations.
Picking the Right Architecture
No single stream processing architecture is universally correct. Lambda is operationally expensive but necessary when incremental computation cannot match batch accuracy. Kappa is simpler and increasingly capable but requires a replayable event log and incremental-friendly business logic. Event sourcing and CQRS add power for applications that need complete audit histories and divergent read models, at the cost of significant complexity.
The most common mistake in stream processing architecture is over-engineering - applying Lambda when Kappa suffices, implementing exactly-once when at-least-once with idempotent writes achieves the same result at lower cost. The second most common mistake is under-engineering - choosing at-least-once without verifying that downstream operations are idempotent.
Work through the decision framework above honestly. The right architecture for your system is the one that satisfies your correctness, latency, and operational constraints with the minimum necessary complexity.