<--- Back to all resources
Migrating from Batch to Streaming: A Practical Playbook
A step-by-step guide for teams moving from batch ETL to streaming pipelines - covering readiness assessment, parallel running, validation, and common pitfalls.
Introduction
Batch ETL is the default state for most data teams. Nightly jobs extract from operational databases, transform in a staging layer, load into a warehouse, and refresh dashboards. For years this was acceptable. Today the business tolerance for stale data is collapsing - marketing wants to know which campaigns are working in the last hour, not last night; operations wants to see inventory levels update as warehouse scans happen, not tomorrow morning; finance wants to close the books continuously, not monthly.
The response is a migration from batch to streaming. But migrations done poorly are expensive, destabilizing, and often abandoned halfway through. This playbook is a practical guide for doing it right: assessing what to migrate, running batch and streaming in parallel, validating equivalence, managing the skills transition, and avoiding the traps that stall most teams.
Step 1: Assess Readiness - What to Migrate and When
The first and most important question is not “how do we migrate?” but “which pipelines should we migrate, and in what order?”
The Freshness Value Matrix
Not all pipelines benefit equally from streaming. Evaluate each pipeline across two dimensions:
- Business value of freshness - How much does the business benefit if this data is 5 minutes fresh instead of 12 hours fresh? Revenue impact? Decision quality? Customer experience?
- Migration complexity - How difficult is this pipeline to migrate? Consider: number of source systems, join complexity, downstream consumers, SLA sensitivity.
Map each pipeline to this matrix:
| Quadrant | Business Value | Complexity | Action |
|---|---|---|---|
| High value, low complexity | High | Low | Migrate first - quick wins |
| High value, high complexity | High | High | Migrate second - requires planning |
| Low value, low complexity | Low | Low | Migrate opportunistically |
| Low value, high complexity | Low | High | Do not migrate - not worth the cost |
Most teams discover that 20% of their pipelines account for 80% of the business value of freshness. Start there.
Readiness Assessment Checklist
Before migrating any pipeline, confirm:
- The source system supports CDC or event streaming (has a transaction log, Kafka topic, or webhook)
- The downstream consumers can tolerate the output format change (streaming vs batch files)
- Monitoring for pipeline lag is in place, not just job health
- A rollback plan exists (keeping the batch pipeline active in parallel)
- Team has basic Kafka and stream processing familiarity
- Data retention policy for the streaming layer is defined
Step 2: Solve the Initial Load Problem First
This is the step most teams skip, and the one that causes the most pain. CDC streaming only captures changes from the moment it starts. Before your streaming pipeline goes live, you need to ensure it has access to the historical state of the source data.
The Three Approaches
Approach 1: Connector-managed snapshot (preferred)
Most production CDC tools handle this automatically. When you connect to a source database, the connector performs a full table scan and emits every existing row as an insert event, then switches to reading the transaction log. Your consumer treats the snapshot records the same as new inserts - no special handling required.
Phase 1: Full snapshot
connector reads all rows → emits as INSERT events → consumer builds initial state
Phase 2: Incremental CDC
connector reads binlog/WAL → emits INSERT/UPDATE/DELETE → consumer applies changes
This approach works well for tables up to a few hundred GB. For very large tables (TB scale), the snapshot can take hours and hold a read lock on the source, which may be unacceptable for production databases.
Approach 2: Historical backfill via separate batch job
Run a one-time batch export of the source table to the streaming destination (Kafka, S3, warehouse), then enable CDC streaming to pick up from the export point. Requires coordination between the batch export timestamp and the CDC start position in the transaction log - get this wrong and you will have a gap or duplicate records.
Approach 3: Dual-write transition period
Keep the batch pipeline writing to the destination while the streaming pipeline warms up on a shadow table. Once the streaming pipeline has caught up to the current state and validated equivalence, cut over. Adds operational complexity but provides the safest migration path for mission-critical pipelines.
Deduplication at the Seam
Regardless of approach, the boundary between the initial load and the incremental stream creates a deduplication challenge. Events emitted during the snapshot may overlap with CDC events at the tail end of the snapshot. Ensure your consumer is idempotent:
-- Idempotent upsert: last-write-wins by event timestamp
MERGE INTO target_table AS t
USING source_event AS s
ON t.primary_key = s.primary_key
WHEN MATCHED AND s.event_time > t.last_updated THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT ...;
Step 3: Run Batch and Streaming in Parallel
This is the most important step in the migration and the one most teams skip in a rush to decommission the old pipeline. Running both pipelines simultaneously allows you to:
- Validate that streaming output matches batch output
- Identify edge cases that your streaming logic does not handle
- Build confidence before cutting over downstream consumers
- Maintain a fallback if streaming has problems
Parallel Architecture
Source Database
│
├── Batch Pipeline (existing)
│ └── Target Table (batch)
│
└── CDC Streaming Pipeline (new)
└── Target Table (streaming)
Validation Job
├── Reads from: Target Table (batch)
├── Reads from: Target Table (streaming)
└── Emits: Discrepancy report
Run both pipelines for a minimum of two weeks before cutting over. One week catches daily patterns; two weeks catches weekly patterns (different behavior on weekends, end-of-week batch jobs that affect source data).
What to Validate
A parallel validation job should compare:
| Check | How to Measure | Acceptable Threshold |
|---|---|---|
| Row count equality | COUNT(*) per day partition | 0% difference |
| Key completeness | Missing PKs in streaming vs batch | 0 missing |
| Metric totals | SUM(revenue), COUNT(users) | < 0.01% difference |
| NULL rate by column | COUNT(*) FILTER (WHERE col IS NULL) | Match within 0.1% |
| Max/min values | Boundary checks per column | Exact match |
| Late arrival handling | Events arriving > 1 hour late | Accounted for in streaming |
Document and investigate every discrepancy, no matter how small. A 0.001% row count difference sounds trivial but represents real missing data that someone will eventually notice.
Step 4: Validate Data Equivalence
Data equivalence validation is a discipline, not a one-time check. Automate it and run it continuously during the parallel period.
Building a Validation Framework
A simple validation framework compares the two outputs daily:
def validate_pipeline_equivalence(batch_conn, stream_conn, table, date):
batch_metrics = query(batch_conn, f"""
SELECT
COUNT(*) AS row_count,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(revenue) AS total_revenue,
MAX(event_time) AS latest_event
FROM {table}
WHERE event_date = '{date}'
""")
stream_metrics = query(stream_conn, f"""
SELECT
COUNT(*) AS row_count,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(revenue) AS total_revenue,
MAX(event_time) AS latest_event
FROM {table}
WHERE event_date = '{date}'
""")
discrepancies = []
for metric in ['row_count', 'unique_customers']:
if batch_metrics[metric] != stream_metrics[metric]:
discrepancies.append({
'metric': metric,
'batch': batch_metrics[metric],
'stream': stream_metrics[metric],
'delta_pct': abs(batch - stream) / batch * 100
})
return discrepancies
Run this daily and alert on any discrepancy above your threshold. Fix the root cause before it accumulates.
Common Equivalence Failures
| Failure | Root Cause | Fix |
|---|---|---|
| Missing rows in streaming | CDC connector missed events during setup | Verify connector offset; re-snapshot if needed |
| Extra rows in streaming | Duplicate events from at-least-once delivery | Add deduplication on primary key + event time |
| Wrong aggregates | Timezone mismatch between batch and streaming | Normalize both to UTC |
| Late rows only in batch | Batch included late data; streaming used strict watermarks | Increase allowed lateness in streaming |
| NULL rate mismatch | Batch applied NULL coalescing; streaming did not | Apply same transformation in streaming layer |
Step 5: Manage the Skills Gap
Batch ETL and stream processing require genuinely different mental models. A data engineer who is expert in SQL, dbt, and Airflow may have never worked with Kafka consumer groups, event time, watermarks, or stateful operators. This is not a criticism - it is a structural gap that must be addressed deliberately.
Skills Required for Streaming
| Skill Area | Batch Equivalent | Streaming Concept |
|---|---|---|
| Data movement | SQL INSERT INTO ... SELECT | Kafka consumer groups, offsets |
| Time-based logic | WHERE date = yesterday | Event time, watermarks, allowed lateness |
| Deduplication | SELECT DISTINCT or ROW_NUMBER() | Exactly-once semantics, idempotent sinks |
| Monitoring | Job completion status | Consumer lag, throughput, checkpoint duration |
| Failure recovery | Re-run the job | Checkpoint restore, offset reset |
| Schema management | dbt schema changes | Schema registry, compatibility modes |
Ramp-Up Path
A practical ramp-up for a batch-oriented data engineer:
- Week 1–2: Kafka fundamentals - producers, consumers, topics, partitions, consumer groups. Build a toy pipeline that reads from Kafka and writes to S3.
- Week 3–4: Stream SQL - write Flink SQL or KSQL queries that filter, aggregate, and join event streams. Focus on tumbling windows and simple enrichment.
- Week 5–6: Event time and watermarks - understand why processing time is wrong for analytics, how watermarks work, and how to handle late data.
- Week 7–8: Stateful processing - keyed state, exactly-once semantics, checkpointing. Build a pipeline with per-key aggregation.
- Month 3: Operate a production pipeline - monitoring consumer lag, responding to incidents, performing upgrades without data loss.
Step 6: Cost Comparison and Surprises
Streaming has a different cost structure than batch. Teams often discover this after migration.
Batch vs Streaming Cost Comparison
| Cost Component | Batch | Streaming |
|---|---|---|
| Compute | Intermittent (runs during job) | Continuous (always running) |
| Storage | Large files, efficient compression | Small messages, more metadata overhead |
| Network egress | One large transfer per cycle | Continuous small transfers |
| Infrastructure | Spin up and down with jobs | Persistent cluster required |
| Operational | Less frequent; issues found at run time | More frequent; issues surface in real time |
For many pipelines, streaming is more expensive on a pure infrastructure basis. The justification is business value: the cost of stale data (missed fraud, slow customer response, wrong inventory decisions) typically exceeds the incremental infrastructure cost.
Use managed streaming platforms to control costs. A self-managed Kafka + Flink cluster for a small pipeline can cost more in engineering time than in infrastructure. Managed solutions like Streamkap consolidate the connector, processing, and delivery layers into a single cost center with predictable pricing.
Migration Prioritization Framework
Use this framework to sequence your migration backlog:
Tier 1: Migrate in the Next Quarter
- Pipelines where data freshness directly impacts revenue (pricing, inventory, campaign spend)
- Pipelines with simple logic (filter, rename, route) and one source/one destination
- Pipelines where the downstream consumer already supports streaming input
Tier 2: Migrate in 6–12 Months
- Pipelines with multiple source joins that require stateful enrichment
- Pipelines where downstream consumers need to be updated to handle streaming
- Pipelines with SLA obligations that require careful cutover planning
Tier 3: Do Not Migrate
- Batch-only sources (files deposited by third parties on a schedule)
- Pipelines where freshness does not affect any decision (annual regulatory reports, archived data)
- Pipelines with extremely complex business logic that is deeply embedded in batch SQL and would require a full rewrite
Common Pitfalls
Pitfall 1: Migrating Everything at Once
The most common failure mode. Teams decide to “go streaming” as a platform initiative and attempt to migrate 50 pipelines in parallel. Each migration uncovers issues that require time to resolve. The team is spread thin, nothing ships on time, and confidence in the initiative collapses.
Fix: Migrate one pipeline at a time. Build a repeatable process on the first migration before scaling.
Pitfall 2: Skipping the Parallel Run
Teams run both batch and streaming for a few days, see “roughly the same numbers,” and cut over. Two months later an analyst notices a 2% discrepancy in monthly revenue that traces back to a timezone bug in the streaming pipeline.
Fix: Two-week minimum parallel run with automated daily equivalence checks.
Pitfall 3: Forgetting Downstream Dependencies
Batch pipelines often have downstream consumers that depend on the output being complete (all records for the day available by 6am). Streaming pipelines produce a continuous trickle - “completeness” is no longer defined the same way.
Fix: Audit every downstream consumer before cutover. Decide whether they need a “completeness checkpoint” (a watermark or explicit signal that all events for a time window have arrived) or whether they can be adapted to query incrementally.
Pitfall 4: Using Processing Time Instead of Event Time
A team migrates a daily aggregate pipeline to streaming and is delighted by the low latency - until they notice that records arriving late (due to system outages at the source) are silently excluded from the wrong day’s aggregation.
Fix: Always use event time for business aggregations. Set allowed lateness to at least the maximum expected delay from the source system.
Pitfall 5: No Monitoring for Lag
Batch pipelines fail visibly - the job does not complete. Streaming pipelines fail silently - the consumer lags behind but continues running. No one notices until the lag is hours or days and the pipeline is effectively batch again.
Fix: Alert on consumer lag, not just job health. A lag threshold of 15 minutes is a reasonable starting point for most pipelines.
Summary
Migrating from batch to streaming is not a single project - it is an ongoing capability shift. The teams that succeed approach it as a sequence of contained migrations, each one building process and confidence before the next.
The five practices that separate successful migrations from failed ones: (1) choosing the right pipelines to migrate first based on business value versus complexity, (2) solving the initial load problem before enabling CDC, (3) running batch and streaming in parallel for at least two weeks with automated equivalence checks, (4) investing in the skills gap deliberately rather than assuming batch engineers will self-educate, and (5) monitoring consumer lag as a first-class production metric.
The reward for doing this carefully is a data platform that reflects the current state of the business - not the state from 12 hours ago.
For a deeper look at the CDC mechanics that power most streaming migrations, see the guide on Change Data Capture fundamentals. For the data preparation work that follows ingestion, see real-time data preparation for analytics.