<--- Back to all resources

Engineering

February 25, 2026

11 min read

The Kappa Architecture: Simplifying Data Pipelines with Streaming

A practical guide to the Kappa architecture pattern. Learn how replacing batch layers with a single streaming pipeline reduces complexity, and when it works best.

TL;DR: • Kappa architecture replaces Lambda's dual batch+streaming layers with a single streaming pipeline. • The core idea: treat everything as a stream, including historical reprocessing. • This eliminates the code duplication and reconciliation problems of Lambda architecture. • Kappa works well when your processing logic is the same for real-time and historical data.

If you have spent any time building data pipelines, you know the pain of maintaining two separate codebases that are supposed to produce the same results: one for batch and one for streaming. The Lambda architecture popularized this dual-layer approach, and while it solved real problems, it introduced new ones. The Kappa architecture is a direct response to that complexity. It asks a simple question: what if you only needed the streaming layer?

The Problem with Lambda

Nathan Marz introduced Lambda architecture to solve a genuine challenge. Batch systems like Hadoop MapReduce were good at processing large historical datasets with accuracy, but they could not deliver results in real time. Stream processors could deliver low-latency results, but they were less mature and harder to make exactly correct. Lambda’s answer was to run both: a batch layer for accuracy and a speed layer for low-latency approximations.

The architecture has three layers:

  • Batch layer - Processes the complete dataset on a schedule (hourly, daily). This is the “source of truth” that periodically recomputes views from scratch.
  • Speed layer - Processes new data in real time, filling the gap between batch runs. Results here are approximate and temporary.
  • Serving layer - Merges batch and speed layer outputs to answer queries.

In theory, this is elegant. In practice, it means you write and maintain the same business logic twice - once in a batch framework (Spark, MapReduce) and once in a streaming framework (Flink, Storm, Kafka Streams). These two implementations need to produce consistent results, which is harder than it sounds. Different execution models, different windowing semantics, different failure modes - all lead to subtle discrepancies that are painful to debug.

You also need a reconciliation process. When the batch layer catches up, the serving layer has to merge batch results with speed layer results and handle the transition. This is operational overhead that scales with the number of pipelines you run.

Jay Kreps and the Case for Simplification

In 2014, Jay Kreps - one of the creators of Apache Kafka - published a blog post called “Questioning the Lambda Architecture.” His argument was straightforward: if stream processing frameworks have matured enough to handle the workloads that batch used to own, why maintain two systems?

Kreps proposed what he called the Kappa architecture. The idea is to replace the batch and speed layers with a single streaming layer, backed by a replayable log (like Kafka). Instead of periodically recomputing everything from scratch with a batch job, you reprocess by replaying events from the log through an updated streaming job.

The key insight is that a replayable log with sufficient retention is your batch layer. If you can read from the beginning of a Kafka topic and replay every event through your streaming job, you get the same result as a batch recomputation - but using the same code, the same framework, and the same operational model.

How Kappa Architecture Works

The Kappa architecture has fewer moving parts than Lambda:

  1. Immutable log - All incoming data is appended to a durable, ordered log. Kafka is the canonical choice, but any system that supports ordered, replayable reads works. The log retains data long enough for reprocessing - days, weeks, or indefinitely with tiered storage.

  2. Stream processing layer - A single set of streaming jobs reads from the log, applies your business logic, and writes results to a serving store (a database, a search index, an analytics warehouse - wherever your consumers need the data).

  3. Reprocessing via replay - When you need to change your processing logic (bug fix, schema evolution, new business rule), you deploy a second instance of your streaming job configured to read from the beginning of the log. This new job replays all historical events through the updated logic. Once it has caught up to the present, you redirect consumers to the new output and shut down the old job.

That is the entire architecture. There is no separate batch layer, no reconciliation between batch and real-time outputs, no two-codepath problem.

The Reprocessing Pattern in Detail

Reprocessing is where most people get skeptical about Kappa, so it is worth walking through the mechanics.

Say you have a streaming job that computes daily revenue by product category, reading from a Kafka topic called orders. You discover a bug in how refunds are handled. Here is what you do:

  1. Fix the logic in your streaming job code.
  2. Deploy the fixed job as a new consumer group, configured with auto.offset.reset=earliest (or equivalent). It starts reading from the oldest available offset in the orders topic.
  3. The new job replays all historical order events through the corrected logic, writing results to a new output table (e.g., revenue_by_category_v2).
  4. Monitor progress. Once the new job’s consumer lag drops to near zero, it has caught up to real-time.
  5. Switch consumers to read from revenue_by_category_v2. Shut down the old job. Optionally drop the old table.

This gives you the same outcome as a batch recomputation, but you only had to write and maintain one processing job.

The practical requirement is that your Kafka topics retain data long enough. With Kafka’s tiered storage (available since Kafka 3.6), you can retain data in object storage like S3 indefinitely without blowing up your broker disk costs. This was one of the original objections to Kappa - that log retention was too expensive - and tiered storage has largely resolved it.

Lambda vs. Kappa: When to Use Each

Kappa is not universally better than Lambda. Each pattern fits different situations.

Kappa works well when:

  • Your batch and streaming logic is fundamentally the same. If you would write the same transformations in both layers, Kappa eliminates the duplication.
  • Your data naturally arrives as events or changes. Transactional databases, clickstreams, IoT sensors, and application logs are all natural fits.
  • Reprocessing volumes are manageable. If replaying a week or month of data through a streaming job completes in a reasonable time, Kappa’s reprocessing model is practical.
  • You want operational simplicity. One framework, one deployment model, one set of monitoring dashboards.

Lambda is still a better fit when:

  • Your batch and streaming logic are fundamentally different. For example, if your batch layer runs large-scale ML model training using Spark MLlib that cannot be expressed as a streaming operator, Lambda’s separation makes sense.
  • You need to reprocess years of data and replaying through a stream processor would take an unreasonable amount of time.
  • Your organization already has heavy investment in batch infrastructure and the migration cost is not justified by the simplification benefits.
  • You need different correctness guarantees for real-time vs. historical results (e.g., approximate real-time dashboards but exact batch reports for compliance).

In practice, many teams start with Lambda because it matches the tools they already know (Spark for batch, a streaming layer bolted on later), and then migrate toward Kappa as their streaming infrastructure matures.

CDC: A Natural Fit for Kappa

Change data capture (CDC) is one of the strongest use cases for Kappa architecture. CDC captures row-level changes from a database - inserts, updates, deletes - and publishes them as an ordered stream of events. This is exactly the kind of data that Kappa is designed to handle.

With CDC, your source database’s transaction log becomes the input to your Kappa pipeline. Every change is captured as an event, written to Kafka, and processed by your streaming jobs. Because CDC events are inherently ordered and represent the complete history of changes, they map perfectly onto Kappa’s replayable log model.

This pattern is especially powerful for:

  • Keeping analytics in sync with operational databases. Instead of nightly batch extracts, CDC streams changes to your warehouse or lake in near real time.
  • Building materialized views. A streaming job reads CDC events and maintains a denormalized, query-optimized view that stays current as the source data changes.
  • Event sourcing. CDC turns your existing relational database into an event source without requiring your application to change how it writes data.

Streamkap is built around this pattern. It captures changes from databases like PostgreSQL, MySQL, and MongoDB via CDC, streams them through Kafka, and delivers them to destinations like Snowflake, BigQuery, and ClickHouse. This is Kappa architecture in practice - a single streaming pipeline replacing batch ETL extracts, with the database’s own transaction log as the immutable event source.

The fit between CDC and Kappa is not a coincidence. Jay Kreps’ original motivation for building Kafka at LinkedIn was precisely this problem: capturing database changes as a stream and making them available to downstream consumers in real time.

The most common technology stack for Kappa architecture today is Apache Kafka for the log layer and Apache Flink for stream processing.

Kafka’s role:

  • Durable, partitioned, ordered log for all incoming events.
  • Consumer groups allow multiple independent processors to read the same data.
  • Retention policies (time-based, size-based, or tiered storage) control how far back you can replay.
  • Kafka Connect provides a large ecosystem of source and sink connectors for getting data in and out.

Flink’s role:

  • Stateful stream processing with exactly-once semantics.
  • Event-time processing with watermarks, so late-arriving data is handled correctly.
  • Savepoints allow you to snapshot job state, update your code, and resume from the snapshot - useful for Kappa-style reprocessing without replaying from scratch.
  • SQL interface (Flink SQL) lets you express many transformations declaratively, lowering the barrier for analysts and data engineers who prefer SQL over Java or Python.

A basic Kappa pipeline with this stack looks like:

Source DB → CDC Connector → Kafka Topic → Flink Job → Serving Store

For reprocessing, you either replay from the Kafka topic’s earliest offset or restore a Flink savepoint with updated job logic.

One practical consideration: Flink savepoints can make reprocessing faster than replaying from scratch. If your logic change only affects a subset of your transformations, you can take a savepoint of the running job, deploy the updated version from that savepoint, and only reprocess the affected state. This is a significant advantage over pure log replay for large-scale systems.

Limitations and Trade-offs

Kappa architecture is simpler, but it is not without trade-offs. Being honest about these helps you decide if it is the right fit.

Reprocessing speed. Replaying months of events through a streaming job is slower than running a batch job over a columnar file format like Parquet. Batch engines are optimized for scanning large datasets; streaming engines are optimized for processing one event at a time (or in micro-batches). For very large reprocessing jobs, this difference matters.

Log retention costs. Even with tiered storage, retaining months or years of raw events has a cost. You need to plan your retention policies carefully and consider whether you need the full raw events or whether compacted topics (which retain only the latest value per key) are sufficient.

Operational complexity of reprocessing. The deploy-new-job, wait-for-catchup, switch-consumers pattern is conceptually simple but operationally nuanced. You need monitoring for consumer lag, a strategy for handling the period where two versions of your output exist, and a rollback plan if the new version produces incorrect results.

Not every workload is a stream. Some processing is inherently batch-oriented. Training an ML model on a full dataset snapshot, generating a monthly compliance report, or running a complex graph algorithm - these workloads do not map naturally to stream processing, and forcing them into a streaming framework adds complexity rather than removing it.

State management at scale. Stateful streaming jobs (windowed aggregations, joins, pattern matching) maintain state that can grow large. Managing this state - checkpointing, recovery, rebalancing across partitions - requires careful tuning and monitoring. Flink handles this well, but it is not zero-effort.

Making the Choice

The decision between Lambda and Kappa often comes down to one question: is your batch logic and your streaming logic the same, or different?

If it is the same - if you are doing the same filters, joins, aggregations, and enrichments regardless of whether the data is “historical” or “real-time” - then maintaining two implementations is pure overhead. Kappa removes that overhead.

If your batch layer does something fundamentally different from your streaming layer - complex analytics, ML training, large-scale joins that a stream processor cannot handle efficiently - then Lambda’s separation is justified.

For many modern data engineering teams, especially those building CDC-based pipelines that stream database changes to analytics destinations, Kappa is the natural choice. The data already arrives as a stream of events. The processing logic is the same whether you are catching up on a backlog or processing the latest changes. And the operational simplicity of maintaining one pipeline instead of two pays dividends every time you need to update, debug, or scale your system.

The shift from Lambda to Kappa reflects a broader trend in data engineering: treating real-time as the default, not the exception. As streaming infrastructure - Kafka, Flink, managed platforms like Streamkap - continues to mature, the cases where you genuinely need a separate batch layer keep shrinking. That does not mean batch is dead. It means that for a growing number of workloads, the streaming path is simpler, faster, and good enough.