How long does a typical batch-to-streaming migration take?

Timeline varies significantly by pipeline complexity. A simple batch job (one source, one destination, no joins) can be migrated in days to weeks. A complex pipeline with multiple sources, business logic joins, SLA-critical outputs, and downstream consumers that depend on specific output formats can take 3–6 months including validation. Teams often underestimate the time required to validate data equivalence between the old and new pipeline - plan for at least 2–4 weeks of parallel running per pipeline.

Do I need to rebuild all my dbt models when migrating to streaming?

Not necessarily. Many teams use a hybrid approach: CDC streaming handles the real-time ingestion and initial transformation layer, writing to a data lakehouse or warehouse table. Existing dbt models then run on top of those tables, refreshed more frequently (every 5–15 minutes instead of nightly). This allows teams to get freshness improvements without a complete rewrite. Full streaming SQL (Flink SQL, KSQL) is warranted when you need sub-minute latency or stateful joins that dbt cannot express.

What happens to historical data when I switch to streaming CDC?

This is the initial load problem. CDC only captures changes from the moment it starts - it does not retroactively capture historical records. Before enabling CDC, you must perform a full snapshot of the source table. Most CDC tools (Debezium, Streamkap) handle this automatically: they take an initial snapshot, then switch to reading the transaction log for subsequent changes. The streaming consumer must be able to handle both the bulk snapshot records and the incremental CDC events without double-counting.

How do I know if my team is ready for streaming?

A team is ready when it has: (1) engineers comfortable with Kafka or an equivalent event streaming platform, (2) someone who understands event-time semantics and watermarks, (3) monitoring and alerting for pipeline lag (not just job health), and (4) a process for handling late data and reprocessing. If two or more of these are missing, invest in foundations before migrating production pipelines. Starting with a low-stakes pipeline is the best way to build the muscle before migrating critical workloads.

<--- Back to all resources

Engineering

February 25, 2026

13 min read

Migrating from Batch to Streaming: A Practical Playbook

A step-by-step guide for teams moving from batch ETL to streaming pipelines - covering readiness assessment, parallel running, validation, and common pitfalls.

TL;DR: • Not every batch pipeline should be migrated to streaming - start with the 20% of pipelines that deliver 80% of the business value of freshness. • Running batch and streaming in parallel before cutting over is the safest migration strategy and the step most teams skip. • The initial load problem is the hardest part of migration and must be planned before writing a single line of streaming code. • Common pitfalls include migrating everything at once, underestimating the skills gap, and forgetting that streaming has different cost structures than batch.

Introduction

Batch ETL is the default state for most data teams. Nightly jobs extract from operational databases, transform in a staging layer, load into a warehouse, and refresh dashboards. For years this was acceptable. Today the business tolerance for stale data is collapsing - marketing wants to know which campaigns are working in the last hour, not last night; operations wants to see inventory levels update as warehouse scans happen, not tomorrow morning; finance wants to close the books continuously, not monthly.

The response is a migration from batch to streaming. But migrations done poorly are expensive, destabilizing, and often abandoned halfway through. This playbook is a practical guide for doing it right: assessing what to migrate, running batch and streaming in parallel, validating equivalence, managing the skills transition, and avoiding the traps that stall most teams.

Step 1: Assess Readiness - What to Migrate and When

The first and most important question is not “how do we migrate?” but “which pipelines should we migrate, and in what order?”

The Freshness Value Matrix

Not all pipelines benefit equally from streaming. Evaluate each pipeline across two dimensions:

Business value of freshness - How much does the business benefit if this data is 5 minutes fresh instead of 12 hours fresh? Revenue impact? Decision quality? Customer experience?
Migration complexity - How difficult is this pipeline to migrate? Consider: number of source systems, join complexity, downstream consumers, SLA sensitivity.

Map each pipeline to this matrix:

Quadrant	Business Value	Complexity	Action
High value, low complexity	High	Low	Migrate first - quick wins
High value, high complexity	High	High	Migrate second - requires planning
Low value, low complexity	Low	Low	Migrate opportunistically
Low value, high complexity	Low	High	Do not migrate - not worth the cost

Most teams discover that 20% of their pipelines account for 80% of the business value of freshness. Start there.

Readiness Assessment Checklist

Before migrating any pipeline, confirm:

The source system supports CDC or event streaming (has a transaction log, Kafka topic, or webhook)
The downstream consumers can tolerate the output format change (streaming vs batch files)
Monitoring for pipeline lag is in place, not just job health
A rollback plan exists (keeping the batch pipeline active in parallel)
Team has basic Kafka and stream processing familiarity
Data retention policy for the streaming layer is defined

Step 2: Solve the Initial Load Problem First

This is the step most teams skip, and the one that causes the most pain. CDC streaming only captures changes from the moment it starts. Before your streaming pipeline goes live, you need to ensure it has access to the historical state of the source data.

The Three Approaches

Approach 1: Connector-managed snapshot (preferred)

Most production CDC tools handle this automatically. When you connect to a source database, the connector performs a full table scan and emits every existing row as an insert event, then switches to reading the transaction log. Your consumer treats the snapshot records the same as new inserts - no special handling required.

Phase 1: Full snapshot
  connector reads all rows → emits as INSERT events → consumer builds initial state

Phase 2: Incremental CDC
  connector reads binlog/WAL → emits INSERT/UPDATE/DELETE → consumer applies changes

This approach works well for tables up to a few hundred GB. For very large tables (TB scale), the snapshot can take hours and hold a read lock on the source, which may be unacceptable for production databases.

Approach 2: Historical backfill via separate batch job

Run a one-time batch export of the source table to the streaming destination (Kafka, S3, warehouse), then enable CDC streaming to pick up from the export point. Requires coordination between the batch export timestamp and the CDC start position in the transaction log - get this wrong and you will have a gap or duplicate records.

Approach 3: Dual-write transition period

Keep the batch pipeline writing to the destination while the streaming pipeline warms up on a shadow table. Once the streaming pipeline has caught up to the current state and validated equivalence, cut over. Adds operational complexity but provides the safest migration path for mission-critical pipelines.

Deduplication at the Seam

Regardless of approach, the boundary between the initial load and the incremental stream creates a deduplication challenge. Events emitted during the snapshot may overlap with CDC events at the tail end of the snapshot. Ensure your consumer is idempotent:

-- Idempotent upsert: last-write-wins by event timestamp
MERGE INTO target_table AS t
USING source_event AS s
ON t.primary_key = s.primary_key
WHEN MATCHED AND s.event_time > t.last_updated THEN
  UPDATE SET ...
WHEN NOT MATCHED THEN
  INSERT ...;

Step 3: Run Batch and Streaming in Parallel

This is the most important step in the migration and the one most teams skip in a rush to decommission the old pipeline. Running both pipelines simultaneously allows you to:

Validate that streaming output matches batch output
Identify edge cases that your streaming logic does not handle
Build confidence before cutting over downstream consumers
Maintain a fallback if streaming has problems

Parallel Architecture

Source Database
    │
    ├── Batch Pipeline (existing)
    │       └── Target Table (batch)
    │
    └── CDC Streaming Pipeline (new)
            └── Target Table (streaming)

Validation Job
    ├── Reads from: Target Table (batch)
    ├── Reads from: Target Table (streaming)
    └── Emits: Discrepancy report

Run both pipelines for a minimum of two weeks before cutting over. One week catches daily patterns; two weeks catches weekly patterns (different behavior on weekends, end-of-week batch jobs that affect source data).

What to Validate

A parallel validation job should compare:

Check	How to Measure	Acceptable Threshold
Row count equality	`COUNT(*)` per day partition	0% difference
Key completeness	Missing PKs in streaming vs batch	0 missing
Metric totals	`SUM(revenue)`, `COUNT(users)`	< 0.01% difference
NULL rate by column	`COUNT(*) FILTER (WHERE col IS NULL)`	Match within 0.1%
Max/min values	Boundary checks per column	Exact match
Late arrival handling	Events arriving > 1 hour late	Accounted for in streaming

Document and investigate every discrepancy, no matter how small. A 0.001% row count difference sounds trivial but represents real missing data that someone will eventually notice.

Step 4: Validate Data Equivalence

Data equivalence validation is a discipline, not a one-time check. Automate it and run it continuously during the parallel period.

Building a Validation Framework

A simple validation framework compares the two outputs daily:

def validate_pipeline_equivalence(batch_conn, stream_conn, table, date):
    batch_metrics = query(batch_conn, f"""
        SELECT
            COUNT(*)                    AS row_count,
            COUNT(DISTINCT customer_id) AS unique_customers,
            SUM(revenue)                AS total_revenue,
            MAX(event_time)             AS latest_event
        FROM {table}
        WHERE event_date = '{date}'
    """)

    stream_metrics = query(stream_conn, f"""
        SELECT
            COUNT(*)                    AS row_count,
            COUNT(DISTINCT customer_id) AS unique_customers,
            SUM(revenue)                AS total_revenue,
            MAX(event_time)             AS latest_event
        FROM {table}
        WHERE event_date = '{date}'
    """)

    discrepancies = []
    for metric in ['row_count', 'unique_customers']:
        if batch_metrics[metric] != stream_metrics[metric]:
            discrepancies.append({
                'metric': metric,
                'batch': batch_metrics[metric],
                'stream': stream_metrics[metric],
                'delta_pct': abs(batch - stream) / batch * 100
            })

    return discrepancies

Run this daily and alert on any discrepancy above your threshold. Fix the root cause before it accumulates.

Common Equivalence Failures

Failure	Root Cause	Fix
Missing rows in streaming	CDC connector missed events during setup	Verify connector offset; re-snapshot if needed
Extra rows in streaming	Duplicate events from at-least-once delivery	Add deduplication on primary key + event time
Wrong aggregates	Timezone mismatch between batch and streaming	Normalize both to UTC
Late rows only in batch	Batch included late data; streaming used strict watermarks	Increase allowed lateness in streaming
NULL rate mismatch	Batch applied NULL coalescing; streaming did not	Apply same transformation in streaming layer

Step 5: Manage the Skills Gap

Batch ETL and stream processing require genuinely different mental models. A data engineer who is expert in SQL, dbt, and Airflow may have never worked with Kafka consumer groups, event time, watermarks, or stateful operators. This is not a criticism - it is a structural gap that must be addressed deliberately.

Skills Required for Streaming

Skill Area	Batch Equivalent	Streaming Concept
Data movement	SQL `INSERT INTO ... SELECT`	Kafka consumer groups, offsets
Time-based logic	`WHERE date = yesterday`	Event time, watermarks, allowed lateness
Deduplication	`SELECT DISTINCT` or `ROW_NUMBER()`	Exactly-once semantics, idempotent sinks
Monitoring	Job completion status	Consumer lag, throughput, checkpoint duration
Failure recovery	Re-run the job	Checkpoint restore, offset reset
Schema management	dbt schema changes	Schema registry, compatibility modes

Ramp-Up Path

A practical ramp-up for a batch-oriented data engineer:

Week 1–2: Kafka fundamentals - producers, consumers, topics, partitions, consumer groups. Build a toy pipeline that reads from Kafka and writes to S3.
Week 3–4: Stream SQL - write Flink SQL or KSQL queries that filter, aggregate, and join event streams. Focus on tumbling windows and simple enrichment.
Week 5–6: Event time and watermarks - understand why processing time is wrong for analytics, how watermarks work, and how to handle late data.
Week 7–8: Stateful processing - keyed state, exactly-once semantics, checkpointing. Build a pipeline with per-key aggregation.
Month 3: Operate a production pipeline - monitoring consumer lag, responding to incidents, performing upgrades without data loss.

Step 6: Cost Comparison and Surprises

Streaming has a different cost structure than batch. Teams often discover this after migration.

Batch vs Streaming Cost Comparison

Cost Component	Batch	Streaming
Compute	Intermittent (runs during job)	Continuous (always running)
Storage	Large files, efficient compression	Small messages, more metadata overhead
Network egress	One large transfer per cycle	Continuous small transfers
Infrastructure	Spin up and down with jobs	Persistent cluster required
Operational	Less frequent; issues found at run time	More frequent; issues surface in real time

For many pipelines, streaming is more expensive on a pure infrastructure basis. The justification is business value: the cost of stale data (missed fraud, slow customer response, wrong inventory decisions) typically exceeds the incremental infrastructure cost.

Use managed streaming platforms to control costs. A self-managed Kafka + Flink cluster for a small pipeline can cost more in engineering time than in infrastructure. Managed solutions like Streamkap consolidate the connector, processing, and delivery layers into a single cost center with predictable pricing.

Migration Prioritization Framework

Use this framework to sequence your migration backlog:

Tier 1: Migrate in the Next Quarter

Pipelines where data freshness directly impacts revenue (pricing, inventory, campaign spend)
Pipelines with simple logic (filter, rename, route) and one source/one destination
Pipelines where the downstream consumer already supports streaming input

Tier 2: Migrate in 6–12 Months

Pipelines with multiple source joins that require stateful enrichment
Pipelines where downstream consumers need to be updated to handle streaming
Pipelines with SLA obligations that require careful cutover planning

Tier 3: Do Not Migrate

Batch-only sources (files deposited by third parties on a schedule)
Pipelines where freshness does not affect any decision (annual regulatory reports, archived data)
Pipelines with extremely complex business logic that is deeply embedded in batch SQL and would require a full rewrite

Common Pitfalls

Pitfall 1: Migrating Everything at Once

The most common failure mode. Teams decide to “go streaming” as a platform initiative and attempt to migrate 50 pipelines in parallel. Each migration uncovers issues that require time to resolve. The team is spread thin, nothing ships on time, and confidence in the initiative collapses.

Fix: Migrate one pipeline at a time. Build a repeatable process on the first migration before scaling.

Pitfall 2: Skipping the Parallel Run

Teams run both batch and streaming for a few days, see “roughly the same numbers,” and cut over. Two months later an analyst notices a 2% discrepancy in monthly revenue that traces back to a timezone bug in the streaming pipeline.

Fix: Two-week minimum parallel run with automated daily equivalence checks.

Pitfall 3: Forgetting Downstream Dependencies

Batch pipelines often have downstream consumers that depend on the output being complete (all records for the day available by 6am). Streaming pipelines produce a continuous trickle - “completeness” is no longer defined the same way.

Fix: Audit every downstream consumer before cutover. Decide whether they need a “completeness checkpoint” (a watermark or explicit signal that all events for a time window have arrived) or whether they can be adapted to query incrementally.

Pitfall 4: Using Processing Time Instead of Event Time

A team migrates a daily aggregate pipeline to streaming and is delighted by the low latency - until they notice that records arriving late (due to system outages at the source) are silently excluded from the wrong day’s aggregation.

Fix: Always use event time for business aggregations. Set allowed lateness to at least the maximum expected delay from the source system.

Pitfall 5: No Monitoring for Lag

Batch pipelines fail visibly - the job does not complete. Streaming pipelines fail silently - the consumer lags behind but continues running. No one notices until the lag is hours or days and the pipeline is effectively batch again.

Fix: Alert on consumer lag, not just job health. A lag threshold of 15 minutes is a reasonable starting point for most pipelines.

Summary

Migrating from batch to streaming is not a single project - it is an ongoing capability shift. The teams that succeed approach it as a sequence of contained migrations, each one building process and confidence before the next.

The five practices that separate successful migrations from failed ones: (1) choosing the right pipelines to migrate first based on business value versus complexity, (2) solving the initial load problem before enabling CDC, (3) running batch and streaming in parallel for at least two weeks with automated equivalence checks, (4) investing in the skills gap deliberately rather than assuming batch engineers will self-educate, and (5) monitoring consumer lag as a first-class production metric.

The reward for doing this carefully is a data platform that reflects the current state of the business - not the state from 12 hours ago.

For a deeper look at the CDC mechanics that power most streaming migrations, see the guide on Change Data Capture fundamentals. For the data preparation work that follows ingestion, see real-time data preparation for analytics.

Platform

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company