<--- Back to all resources

Engineering

February 25, 2026

13 min read

Zero-Downtime Database Migration with Change Data Capture

A practical engineering guide to migrating databases without downtime using Change Data Capture - covering the full process from initial sync through cutover, with validation strategies and common pitfalls.

TL;DR: • Traditional database migrations require downtime because there is a window where the old and new databases are both accepting writes. CDC eliminates this by continuously streaming changes from the source to the target during the migration window. • The process has three phases: initial bulk load, CDC-based continuous sync, and cutover - with validation at each stage. • Cutover risk is the most critical moment: plan it as a read-only window of seconds, not minutes. • Common failure modes include schema mismatches, sequence drift, missed foreign key constraints, and assuming the bulk load is a one-time operation.

Database migrations are one of the most nerve-wracking operations in software engineering. The traditional approach - take the application offline, run the migration, bring it back up - is increasingly unacceptable for production systems. Users expect availability. Revenue depends on uptime. A maintenance window that once felt routine now carries significant business risk.

Change Data Capture (CDC) makes zero-downtime database migration achievable. This guide walks through the full process: why CDC works for migrations, the three-phase migration playbook, validation strategies, cutover execution, and the pitfalls that catch teams off guard.

Why Traditional Migrations Cause Downtime

In a traditional database migration, the process looks like this:

  1. Put the application in maintenance mode.
  2. Dump the source database.
  3. Import the dump into the target database.
  4. Run schema migrations on the target.
  5. Update application configuration to point to the new database.
  6. Bring the application back up.

The maintenance window is necessary because the source database is live during the dump, and any writes that happen after the dump starts are not captured in the dump file. If you redirect traffic while there are missing writes, your data is inconsistent. So you freeze writes, dump, import, redirect.

The problem is that the dump and import of a large database can take hours. For a 500GB database with complex indexes and foreign keys, the import alone can run for 6–12 hours. That is 6–12 hours of maintenance window - completely unacceptable for most production systems.

CDC solves this by replacing the “dump and restore” window with a continuous sync. Instead of a one-time snapshot, CDC streams every change from the source to the target in near real-time, continuously, until you are ready to cut over.

How CDC-Based Migration Works

CDC-based migration has three phases:

Phase 1: Initial Bulk Load A full copy of the source database is loaded into the target. This can take hours or days for large databases, but that is fine - the source remains fully operational throughout.

Phase 2: Continuous CDC Sync While the bulk load is in progress (and after it completes), CDC streams every INSERT, UPDATE, and DELETE from the source database to the target. This keeps the target synchronized with the source, narrowing the gap until it reaches near-zero lag.

Phase 3: Cutover When the team is confident in the target’s correctness and replication lag is consistently low, the cutover is executed: writes to the source are briefly stopped, lag is confirmed to reach zero, and application traffic is redirected to the target. The maintenance window is measured in seconds.

Phase 1: Initial Bulk Load

The bulk load establishes the baseline state of the target database. There are several approaches:

Database Dump and Restore

For most databases, the native dump utility is the simplest starting point:

# PostgreSQL
pg_dump --no-owner --no-acl --format=custom source_db > dump.pgdump
pg_restore --no-owner --no-acl -d target_db dump.pgdump

# MySQL
mysqldump --single-transaction --set-gtid-purged=OFF source_db > dump.sql
mysql target_db < dump.sql

For large databases, use parallel restore options:

pg_restore --jobs=8 --no-owner -d target_db dump.pgdump

Export/Import via ETL

For cross-engine migrations (e.g., MySQL to PostgreSQL), you cannot use native dump/restore. Use a tool like AWS Database Migration Service (DMS), pgloader, or a custom ETL script that reads from the source and writes to the target with appropriate type mapping.

Handling the Bulk Load / CDC Boundary

This is the most critical technical detail in the entire migration: the bulk load and the CDC stream must be consistent. Specifically, you need to know the exact replication log position at the time the bulk load snapshot was taken, so the CDC stream starts from that position - not from the beginning of the log (which may not exist) and not from “now” (which would miss changes made during the bulk load).

Most CDC tools handle this automatically. Debezium, for example, takes a consistent snapshot of the database and records the LSN (Log Sequence Number in PostgreSQL) or binlog position (in MySQL) at the time of the snapshot. The CDC stream then starts from that recorded position, ensuring that changes made during the snapshot are not missed.

If you are using DMS or another tool for the bulk load, consult the documentation for how it handles the initial snapshot position. This is non-negotiable - getting this wrong means you start CDC from an inconsistent position, and silent data corruption follows.

Phase 2: Continuous CDC Sync

With the bulk load complete and CDC running from the correct log position, the target database begins receiving a continuous stream of changes. At this point, your migration is in a steady state:

  • Source database: fully operational, receiving application writes
  • CDC pipeline: streaming every change from source to target
  • Target database: catching up and then maintaining near-zero lag

During this phase, monitor replication lag continuously. Lag is the delay between a write committing on the source and the equivalent write landing on the target. Healthy lag for a CDC migration is typically under 1 second for light-to-moderate workloads. Sustained high lag (minutes or more) indicates a bottleneck - often in the CDC pipeline throughput, the target database’s write capacity, or schema differences causing apply errors.

What to Do During This Phase

This is the right time to:

Run validation (see Validation section below)

Test the target under load: Before you cut over, connect your staging or test environments to the target database and run your application’s test suite against it. This surfaces issues with query performance, index differences, and type coercions that will not show up in row count validation.

Perform a trial cutover: Execute the full cutover procedure against a non-production environment. This builds operational confidence and reveals procedural gaps before the real thing.

Optimize the target: Add indexes, adjust configuration parameters, and tune the target database to match or exceed source performance. This is also the right time to make schema improvements - adding new indexes or restructuring tables that were constraints on the source.

Validation Strategies

Do not trust that the migration is correct without validating it. Validation has three levels:

Level 1: Row Count Validation

The minimum bar. Compare row counts for every table between source and target:

-- Run on source
SELECT 'orders' AS table_name, COUNT(*) AS row_count FROM orders
UNION ALL
SELECT 'order_items', COUNT(*) FROM order_items
UNION ALL
SELECT 'customers', COUNT(*) FROM customers;

Run the same query on the target and compare. Row count differences indicate missing or duplicate rows. Note that during CDC sync, small differences are expected (due to lag) - check that counts converge over time rather than diverge.

Level 2: Checksum Validation

Row counts confirm quantity. Checksums confirm content. For each table, compute a checksum over a representative sample of rows:

-- PostgreSQL: checksum a sample of rows
SELECT md5(string_agg(md5(row::text), '' ORDER BY id))
FROM (SELECT * FROM orders ORDER BY id LIMIT 10000) AS sample;

Compare the result on source and target. Differences indicate rows that exist on both sides but have different content - often caused by type coercion issues, timezone differences, or encoding mismatches.

Level 3: Application-Level Validation

Run your application’s test suite against the target. Read paths (SELECT queries) should return identical results. Write paths should behave correctly. This catches issues that row-level validation misses: index missing from query plan, function behavior difference between database versions, trigger semantics that differ between engines.

For critical tables (financial data, user records), consider running a dual-write / dual-read validation: send reads to both source and target and compare responses. Any discrepancy triggers an alert. This is the gold standard for high-stakes migrations but requires application-level integration.

Cutover Execution

Cutover is the most operationally sensitive moment in the migration. It should be planned precisely and rehearsed.

Pre-Cutover Checklist

Before initiating cutover, confirm:

  • Replication lag is consistently below 1 second (not just occasionally low)
  • Row count validation passed for all critical tables
  • Application test suite passes against the target
  • The target database connection string is ready in all deployment configs
  • Rollback plan is documented and team is briefed
  • The migration team is present and on a communication channel
  • A monitoring dashboard is open showing both source and target metrics

The Cutover Procedure

Step 1: Stop writes to the source

This can be done by:

  • Putting the application in read-only mode (preferred - application continues serving reads)
  • Shutting down the application entirely
  • Revoking write access to the source database at the network or database level

The read-only approach keeps users in the application. The maintenance window is for writes only.

Step 2: Wait for replication lag to reach zero

Monitor your CDC pipeline’s lag metric. Wait until it reports zero (or the target row counts match the source exactly). This step should take seconds to tens of seconds if lag was already low. If it takes minutes, your pre-cutover lag was higher than you thought - investigate before proceeding.

Step 3: Validate

Run a final row count check on your critical tables. This takes seconds with a prepared script and confirms that no rows were lost between stopping writes and the CDC pipeline draining.

Step 4: Redirect traffic

Update the database connection string in your application configuration and restart application instances (or apply the configuration change through your deployment system). DNS-based routing, load balancer connection string management, or application environment variable updates are common mechanisms.

Step 5: Verify

Immediately after redirecting traffic:

  • Check application error rates
  • Check that writes to the target are succeeding
  • Verify that reads are returning expected data
  • Run a smoke test against your most critical user flows

Step 6: Keep the source available for rollback

Do not shut down the source database immediately. Keep it available (read-only or in standby) for at least 24 hours. If a critical issue surfaces after cutover that requires rollback, you want the source to be there.

Common Pitfalls

Sequence / Auto-Increment Drift

In databases with auto-increment primary keys or sequences, the target sequence may fall behind the source. If the source has inserted rows up to ID 1,000,000 but the target sequence is at 500,000, new inserts on the target will conflict with replicated rows. Before cutover, advance all sequences on the target to a value higher than the source’s current maximum:

-- PostgreSQL: advance the sequence
SELECT setval('orders_id_seq', (SELECT MAX(id) FROM orders) + 10000);

Foreign Key Constraint Failures

CDC streams changes in the order they committed on the source. This usually means parent rows arrive before child rows. But if you have circular foreign keys, or if your bulk load loaded tables in the wrong order, you may see constraint violations on the target.

During the migration, consider deferring foreign key constraint checks on the target (SET session_replication_role = replica; in PostgreSQL disables foreign key checks temporarily). Re-enable them before cutover and validate the integrity of your target.

Character Encoding and Collation Mismatches

If the source is latin1 and the target is utf8mb4, you may lose data for characters outside the latin1 range, or see encoding errors during the migration. If the source collation is utf8_general_ci (case-insensitive) and the target is utf8_bin (binary/case-sensitive), queries that worked on the source may return different results on the target.

Identify encoding and collation differences before the migration and resolve them explicitly.

Schema Drift During the Migration

If your development team is running schema migrations against the source database while the CDC migration is in progress, you need to apply equivalent migrations to the target proactively. A schema migration on the source (e.g., ALTER TABLE orders ADD COLUMN discount_pct NUMERIC) will cause CDC events with the new column to arrive at the target, which does not have the column yet, causing apply errors and lag.

Establish a protocol during the migration window: schema changes to the source must be coordinated and applied to the target within the same deployment.

Using Streamkap for Database Migration

Running a CDC-based migration manually requires configuring a CDC tool (Debezium, DMS, or another connector), managing the replication slot on the source database, monitoring lag, and handling restarts gracefully. For teams that want a managed migration experience, Streamkap provides a platform that handles the CDC pipeline end-to-end.

You define the source database, the target, and the tables to migrate. Streamkap handles the initial snapshot position, streams changes continuously, provides a lag monitoring dashboard, and surfaces schema drift and apply errors in the UI rather than in log files. The cutover process remains in your hands - you control when and how traffic moves - but the complexity of the sync infrastructure is managed for you.

This is particularly useful for teams running multiple simultaneous migrations (e.g., migrating several microservices databases in parallel) or for organizations that need audit trails of what data moved when.

Summary

Zero-downtime database migration with CDC is the standard approach for production systems that cannot tolerate hours-long maintenance windows. The process is straightforward in principle - bulk load, continuous sync, cutover - but requires careful attention to the consistency boundary between the bulk load and the CDC stream, thorough validation at multiple levels, and a well-rehearsed cutover procedure.

The payoff is significant: a cutover window measured in seconds rather than hours, full rollback capability, and the ability to run validation in production conditions before committing to the new database. For any migration involving a database that serves real user traffic, CDC-based migration is the approach that preserves both data integrity and service availability.