<--- Back to all resources

Engineering

February 25, 2026

13 min read

Migrating from Batch to Streaming: A Practical Playbook

A step-by-step guide for teams moving from batch ETL to streaming pipelines - covering readiness assessment, parallel running, validation, and common pitfalls.

TL;DR: • Not every batch pipeline should be migrated to streaming - start with the 20% of pipelines that deliver 80% of the business value of freshness. • Running batch and streaming in parallel before cutting over is the safest migration strategy and the step most teams skip. • The initial load problem is the hardest part of migration and must be planned before writing a single line of streaming code. • Common pitfalls include migrating everything at once, underestimating the skills gap, and forgetting that streaming has different cost structures than batch.

Introduction

Batch ETL is the default state for most data teams. Nightly jobs extract from operational databases, transform in a staging layer, load into a warehouse, and refresh dashboards. For years this was acceptable. Today the business tolerance for stale data is collapsing - marketing wants to know which campaigns are working in the last hour, not last night; operations wants to see inventory levels update as warehouse scans happen, not tomorrow morning; finance wants to close the books continuously, not monthly.

The response is a migration from batch to streaming. But migrations done poorly are expensive, destabilizing, and often abandoned halfway through. This playbook is a practical guide for doing it right: assessing what to migrate, running batch and streaming in parallel, validating equivalence, managing the skills transition, and avoiding the traps that stall most teams.


Step 1: Assess Readiness - What to Migrate and When

The first and most important question is not “how do we migrate?” but “which pipelines should we migrate, and in what order?”

The Freshness Value Matrix

Not all pipelines benefit equally from streaming. Evaluate each pipeline across two dimensions:

  • Business value of freshness - How much does the business benefit if this data is 5 minutes fresh instead of 12 hours fresh? Revenue impact? Decision quality? Customer experience?
  • Migration complexity - How difficult is this pipeline to migrate? Consider: number of source systems, join complexity, downstream consumers, SLA sensitivity.

Map each pipeline to this matrix:

QuadrantBusiness ValueComplexityAction
High value, low complexityHighLowMigrate first - quick wins
High value, high complexityHighHighMigrate second - requires planning
Low value, low complexityLowLowMigrate opportunistically
Low value, high complexityLowHighDo not migrate - not worth the cost

Most teams discover that 20% of their pipelines account for 80% of the business value of freshness. Start there.

Readiness Assessment Checklist

Before migrating any pipeline, confirm:

  • The source system supports CDC or event streaming (has a transaction log, Kafka topic, or webhook)
  • The downstream consumers can tolerate the output format change (streaming vs batch files)
  • Monitoring for pipeline lag is in place, not just job health
  • A rollback plan exists (keeping the batch pipeline active in parallel)
  • Team has basic Kafka and stream processing familiarity
  • Data retention policy for the streaming layer is defined

Step 2: Solve the Initial Load Problem First

This is the step most teams skip, and the one that causes the most pain. CDC streaming only captures changes from the moment it starts. Before your streaming pipeline goes live, you need to ensure it has access to the historical state of the source data.

The Three Approaches

Approach 1: Connector-managed snapshot (preferred)

Most production CDC tools handle this automatically. When you connect to a source database, the connector performs a full table scan and emits every existing row as an insert event, then switches to reading the transaction log. Your consumer treats the snapshot records the same as new inserts - no special handling required.

Phase 1: Full snapshot
  connector reads all rows → emits as INSERT events → consumer builds initial state

Phase 2: Incremental CDC
  connector reads binlog/WAL → emits INSERT/UPDATE/DELETE → consumer applies changes

This approach works well for tables up to a few hundred GB. For very large tables (TB scale), the snapshot can take hours and hold a read lock on the source, which may be unacceptable for production databases.

Approach 2: Historical backfill via separate batch job

Run a one-time batch export of the source table to the streaming destination (Kafka, S3, warehouse), then enable CDC streaming to pick up from the export point. Requires coordination between the batch export timestamp and the CDC start position in the transaction log - get this wrong and you will have a gap or duplicate records.

Approach 3: Dual-write transition period

Keep the batch pipeline writing to the destination while the streaming pipeline warms up on a shadow table. Once the streaming pipeline has caught up to the current state and validated equivalence, cut over. Adds operational complexity but provides the safest migration path for mission-critical pipelines.

Deduplication at the Seam

Regardless of approach, the boundary between the initial load and the incremental stream creates a deduplication challenge. Events emitted during the snapshot may overlap with CDC events at the tail end of the snapshot. Ensure your consumer is idempotent:

-- Idempotent upsert: last-write-wins by event timestamp
MERGE INTO target_table AS t
USING source_event AS s
ON t.primary_key = s.primary_key
WHEN MATCHED AND s.event_time > t.last_updated THEN
  UPDATE SET ...
WHEN NOT MATCHED THEN
  INSERT ...;

Step 3: Run Batch and Streaming in Parallel

This is the most important step in the migration and the one most teams skip in a rush to decommission the old pipeline. Running both pipelines simultaneously allows you to:

  • Validate that streaming output matches batch output
  • Identify edge cases that your streaming logic does not handle
  • Build confidence before cutting over downstream consumers
  • Maintain a fallback if streaming has problems

Parallel Architecture

Source Database

    ├── Batch Pipeline (existing)
    │       └── Target Table (batch)

    └── CDC Streaming Pipeline (new)
            └── Target Table (streaming)

Validation Job
    ├── Reads from: Target Table (batch)
    ├── Reads from: Target Table (streaming)
    └── Emits: Discrepancy report

Run both pipelines for a minimum of two weeks before cutting over. One week catches daily patterns; two weeks catches weekly patterns (different behavior on weekends, end-of-week batch jobs that affect source data).

What to Validate

A parallel validation job should compare:

CheckHow to MeasureAcceptable Threshold
Row count equalityCOUNT(*) per day partition0% difference
Key completenessMissing PKs in streaming vs batch0 missing
Metric totalsSUM(revenue), COUNT(users)< 0.01% difference
NULL rate by columnCOUNT(*) FILTER (WHERE col IS NULL)Match within 0.1%
Max/min valuesBoundary checks per columnExact match
Late arrival handlingEvents arriving > 1 hour lateAccounted for in streaming

Document and investigate every discrepancy, no matter how small. A 0.001% row count difference sounds trivial but represents real missing data that someone will eventually notice.


Step 4: Validate Data Equivalence

Data equivalence validation is a discipline, not a one-time check. Automate it and run it continuously during the parallel period.

Building a Validation Framework

A simple validation framework compares the two outputs daily:

def validate_pipeline_equivalence(batch_conn, stream_conn, table, date):
    batch_metrics = query(batch_conn, f"""
        SELECT
            COUNT(*)                    AS row_count,
            COUNT(DISTINCT customer_id) AS unique_customers,
            SUM(revenue)                AS total_revenue,
            MAX(event_time)             AS latest_event
        FROM {table}
        WHERE event_date = '{date}'
    """)

    stream_metrics = query(stream_conn, f"""
        SELECT
            COUNT(*)                    AS row_count,
            COUNT(DISTINCT customer_id) AS unique_customers,
            SUM(revenue)                AS total_revenue,
            MAX(event_time)             AS latest_event
        FROM {table}
        WHERE event_date = '{date}'
    """)

    discrepancies = []
    for metric in ['row_count', 'unique_customers']:
        if batch_metrics[metric] != stream_metrics[metric]:
            discrepancies.append({
                'metric': metric,
                'batch': batch_metrics[metric],
                'stream': stream_metrics[metric],
                'delta_pct': abs(batch - stream) / batch * 100
            })

    return discrepancies

Run this daily and alert on any discrepancy above your threshold. Fix the root cause before it accumulates.

Common Equivalence Failures

FailureRoot CauseFix
Missing rows in streamingCDC connector missed events during setupVerify connector offset; re-snapshot if needed
Extra rows in streamingDuplicate events from at-least-once deliveryAdd deduplication on primary key + event time
Wrong aggregatesTimezone mismatch between batch and streamingNormalize both to UTC
Late rows only in batchBatch included late data; streaming used strict watermarksIncrease allowed lateness in streaming
NULL rate mismatchBatch applied NULL coalescing; streaming did notApply same transformation in streaming layer

Step 5: Manage the Skills Gap

Batch ETL and stream processing require genuinely different mental models. A data engineer who is expert in SQL, dbt, and Airflow may have never worked with Kafka consumer groups, event time, watermarks, or stateful operators. This is not a criticism - it is a structural gap that must be addressed deliberately.

Skills Required for Streaming

Skill AreaBatch EquivalentStreaming Concept
Data movementSQL INSERT INTO ... SELECTKafka consumer groups, offsets
Time-based logicWHERE date = yesterdayEvent time, watermarks, allowed lateness
DeduplicationSELECT DISTINCT or ROW_NUMBER()Exactly-once semantics, idempotent sinks
MonitoringJob completion statusConsumer lag, throughput, checkpoint duration
Failure recoveryRe-run the jobCheckpoint restore, offset reset
Schema managementdbt schema changesSchema registry, compatibility modes

Ramp-Up Path

A practical ramp-up for a batch-oriented data engineer:

  1. Week 1–2: Kafka fundamentals - producers, consumers, topics, partitions, consumer groups. Build a toy pipeline that reads from Kafka and writes to S3.
  2. Week 3–4: Stream SQL - write Flink SQL or KSQL queries that filter, aggregate, and join event streams. Focus on tumbling windows and simple enrichment.
  3. Week 5–6: Event time and watermarks - understand why processing time is wrong for analytics, how watermarks work, and how to handle late data.
  4. Week 7–8: Stateful processing - keyed state, exactly-once semantics, checkpointing. Build a pipeline with per-key aggregation.
  5. Month 3: Operate a production pipeline - monitoring consumer lag, responding to incidents, performing upgrades without data loss.

Step 6: Cost Comparison and Surprises

Streaming has a different cost structure than batch. Teams often discover this after migration.

Batch vs Streaming Cost Comparison

Cost ComponentBatchStreaming
ComputeIntermittent (runs during job)Continuous (always running)
StorageLarge files, efficient compressionSmall messages, more metadata overhead
Network egressOne large transfer per cycleContinuous small transfers
InfrastructureSpin up and down with jobsPersistent cluster required
OperationalLess frequent; issues found at run timeMore frequent; issues surface in real time

For many pipelines, streaming is more expensive on a pure infrastructure basis. The justification is business value: the cost of stale data (missed fraud, slow customer response, wrong inventory decisions) typically exceeds the incremental infrastructure cost.

Use managed streaming platforms to control costs. A self-managed Kafka + Flink cluster for a small pipeline can cost more in engineering time than in infrastructure. Managed solutions like Streamkap consolidate the connector, processing, and delivery layers into a single cost center with predictable pricing.


Migration Prioritization Framework

Use this framework to sequence your migration backlog:

Tier 1: Migrate in the Next Quarter

  • Pipelines where data freshness directly impacts revenue (pricing, inventory, campaign spend)
  • Pipelines with simple logic (filter, rename, route) and one source/one destination
  • Pipelines where the downstream consumer already supports streaming input

Tier 2: Migrate in 6–12 Months

  • Pipelines with multiple source joins that require stateful enrichment
  • Pipelines where downstream consumers need to be updated to handle streaming
  • Pipelines with SLA obligations that require careful cutover planning

Tier 3: Do Not Migrate

  • Batch-only sources (files deposited by third parties on a schedule)
  • Pipelines where freshness does not affect any decision (annual regulatory reports, archived data)
  • Pipelines with extremely complex business logic that is deeply embedded in batch SQL and would require a full rewrite

Common Pitfalls

Pitfall 1: Migrating Everything at Once

The most common failure mode. Teams decide to “go streaming” as a platform initiative and attempt to migrate 50 pipelines in parallel. Each migration uncovers issues that require time to resolve. The team is spread thin, nothing ships on time, and confidence in the initiative collapses.

Fix: Migrate one pipeline at a time. Build a repeatable process on the first migration before scaling.

Pitfall 2: Skipping the Parallel Run

Teams run both batch and streaming for a few days, see “roughly the same numbers,” and cut over. Two months later an analyst notices a 2% discrepancy in monthly revenue that traces back to a timezone bug in the streaming pipeline.

Fix: Two-week minimum parallel run with automated daily equivalence checks.

Pitfall 3: Forgetting Downstream Dependencies

Batch pipelines often have downstream consumers that depend on the output being complete (all records for the day available by 6am). Streaming pipelines produce a continuous trickle - “completeness” is no longer defined the same way.

Fix: Audit every downstream consumer before cutover. Decide whether they need a “completeness checkpoint” (a watermark or explicit signal that all events for a time window have arrived) or whether they can be adapted to query incrementally.

Pitfall 4: Using Processing Time Instead of Event Time

A team migrates a daily aggregate pipeline to streaming and is delighted by the low latency - until they notice that records arriving late (due to system outages at the source) are silently excluded from the wrong day’s aggregation.

Fix: Always use event time for business aggregations. Set allowed lateness to at least the maximum expected delay from the source system.

Pitfall 5: No Monitoring for Lag

Batch pipelines fail visibly - the job does not complete. Streaming pipelines fail silently - the consumer lags behind but continues running. No one notices until the lag is hours or days and the pipeline is effectively batch again.

Fix: Alert on consumer lag, not just job health. A lag threshold of 15 minutes is a reasonable starting point for most pipelines.


Summary

Migrating from batch to streaming is not a single project - it is an ongoing capability shift. The teams that succeed approach it as a sequence of contained migrations, each one building process and confidence before the next.

The five practices that separate successful migrations from failed ones: (1) choosing the right pipelines to migrate first based on business value versus complexity, (2) solving the initial load problem before enabling CDC, (3) running batch and streaming in parallel for at least two weeks with automated equivalence checks, (4) investing in the skills gap deliberately rather than assuming batch engineers will self-educate, and (5) monitoring consumer lag as a first-class production metric.

The reward for doing this carefully is a data platform that reflects the current state of the business - not the state from 12 hours ago.

For a deeper look at the CDC mechanics that power most streaming migrations, see the guide on Change Data Capture fundamentals. For the data preparation work that follows ingestion, see real-time data preparation for analytics.