9 min read
Debezium Snapshot Strategies: Incremental, Parallel, and Hybrid Approaches for Production
How to match snapshot strategy to table size in Debezium-based pipelines: mode selection, WAL slot management, and production validation without downtime.
You trigger an initial snapshot on a 300-million-row orders table. The connector estimates 11 hours. Halfway through, PostgreSQL drops the replication slot because retained WAL exceeded max_slot_wal_keep_size. The snapshot fails. It can’t resume. You spend another 11 hours on the restart and hit the same ceiling.
This is the most common snapshot failure pattern in CDC deployments, and it’s entirely avoidable. The failure isn’t the connector — it’s the mismatch between snapshot mode and table profile. Picking the wrong mode for a large table doesn’t just slow things down; it creates a failure sequence where each retry costs as much as the first attempt.
This guide covers the three snapshot modes available in Debezium-based connectors, when to combine them into a hybrid strategy, how to tune WAL retention so slots survive long snapshots, and how to verify that what arrived at the destination actually matches what the source contained.
Why Snapshot Strategy Is an Architectural Decision
Most teams treat snapshot mode as a one-time setup choice. Pick a mode, start the connector, wait for the initial load, move on. At small scale, this works fine. At production scale — tables in the hundreds of millions of rows, dozens of tables, databases under continuous write load — the initial snapshot becomes its own operational project with its own risk surface.
Three failure modes appear at scale, and each one is driven by a different cause.
WAL bloat is the most common. An incremental snapshot running for many hours pins the replication slot, which holds WAL from that slot’s restart_lsn forward.1 If the source database generates more WAL than disk can hold before the snapshot finishes, PostgreSQL either evicts the slot or fills the disk. Neither outcome is graceful.
Forced re-snapshot is the second. Blocking snapshots aren’t resumable. A failure halfway through means starting from row one. For a six-hour snapshot, that’s six hours added to the clock on every failure, compounding if the underlying condition isn’t fixed first.
Silent truncation is the third and least visible. A snapshot that reports success without row-count verification looks healthy until a downstream analytical query returns results that don’t match the source. The connector had no way to know rows were missing; the destination had no way to complain.
Matching snapshot mode to table characteristics is what prevents all three. The mode decision belongs at the same level as deciding how to set wal_level or how many replication slots to provision.
Snapshot Modes: When to Use Each
The three modes in Streamkap’s Debezium-based connectors differ in whether streaming continues during the snapshot, whether failure is recoverable, and whether the table needs a primary key.2
Incremental (Full and Filtered)
Full and filtered snapshots use watermarked chunked reads. Before reading each chunk, the connector writes a low watermark signal. After reading the chunk, it writes a high watermark. Streaming events that arrive between the two watermarks are deduplicated against the snapshot chunk. Streaming continues throughout the operation. If the snapshot fails, it resumes from the last committed chunk.
Full snapshots capture every row in the selected tables. Filtered snapshots capture a subset based on a WHERE clause — the right choice when you only need a specific date range back-filled, or when you’re recovering from a partial outage that affected only a bounded window of data. Both modes process tables sequentially: one table completes before the next begins.
The trade-off is throughput. Chunked reads run alongside the source’s normal write traffic, and the connector optimizes chunk size automatically, but source performance is still the ceiling. For a 300M-row table, an incremental snapshot may take 12–18 hours depending on row width, source load, and network latency. That duration is acceptable when streaming continuity matters and you want failures to stay small and resumable.
One behavior to understand before you go to production: rows updated or deleted while an incremental snapshot runs may appear out of order in the destination. You might see a read event followed by an update, or a read followed by a delete. The connector resolves these by matching snapshot chunks against the streaming log and deduplicating, but only when the same row appears in both tracks.3 Your destination should handle idempotent writes regardless. Upsert semantics at the destination remove the ambiguity entirely.
A surrogate key option is worth knowing. If a table has no natural primary key, you can specify an alternative column — a timestamp or auto-increment field — to drive chunking. This lets tables without primary keys use incremental mode rather than requiring a blocking snapshot.
Blocking
Blocking snapshots read all rows in a single transaction. Streaming pauses for the duration and resumes automatically when the snapshot completes. Multiple tables may run in parallel depending on connector configuration, which cuts total time when you have many large tables to snapshot concurrently.
Use blocking mode when streaming can tolerate a pause, when the table has no primary key and you don’t want to configure a surrogate key, or when point-in-time consistency across several related tables matters more than uninterrupted streaming. It’s also the faster option for large tables when you need the initial load finished quickly and your WAL retention can cover the pause window.4
Two constraints matter in production. First, blocking snapshots aren’t resumable. A mid-snapshot failure means a complete restart. On a table that takes four hours to snapshot, plan for that risk explicitly — schedule the snapshot during a low-traffic window or provision enough WAL retention to handle the worst case. Second, blocking snapshots hold a database lock for the duration. On high-traffic tables, that lock can create concurrency pressure that’s visible to application queries. Streamkap’s docs flag this explicitly: use blocking mode with care on tables under heavy concurrent write load.
There’s also a brief duplication window at snapshot completion. The transition back to streaming can emit a small number of duplicate events as the system synchronizes. Destination-side upserts or deduplication handle this cleanly.
Hybrid Strategies
No single mode fits every table in a typical schema. The practical approach is per-table mode selection, treating snapshot strategy as a configuration decision for each table rather than a connector-wide setting.
For a fresh connector with mixed table sizes, you might snapshot small tables (under 1M rows) with blocking mode to finish quickly, while running larger tables with Full (incremental) mode so streaming continues and failures stay contained. This isn’t a built-in feature — it’s a scheduling pattern: trigger the blocking snapshots first to completion, then kick off the incremental ones.
A second hybrid pattern handles partial recovery. If a connector falls behind and only a subset of tables needs catching up, a filtered snapshot with a closed date range brings just that window back without re-snapshotting the full table. Use closed ranges: created_at >= '2026-06-01' AND created_at < '2026-06-08' rather than open-ended filters. Bounded scope keeps the operation predictable and the WAL footprint calculable.
For very large tables with primary keys, filtered time-range snapshots also let you parallelize manually. Trigger a January filter, a February filter, and a March filter in sequence, each running as an incremental snapshot over a manageable chunk of rows. Each chunk is resumable independently.
WAL Retention and Slot Management During Snapshots
The replication slot is the point of most fragility during a long snapshot. It pins WAL from its restart_lsn forward, preventing PostgreSQL from recycling log segments that the connector hasn’t yet consumed. During an incremental snapshot, the slot position doesn’t advance until streaming resumes. For a 14-hour snapshot on a busy source, that’s 14 hours of WAL accumulation — on top of whatever the database normally generates.5
Two PostgreSQL settings determine whether the slot survives. wal_keep_size sets a minimum WAL retention regardless of slot demand. max_slot_wal_keep_size (available from PostgreSQL 13) sets a ceiling: when a slot’s retained WAL exceeds this value, PostgreSQL drops the slot to protect disk space, without warning the connector.6
Before starting a long snapshot, calculate the exposure: measure how much WAL the source generates per hour at peak load, multiply by the expected snapshot duration, and add a 50% buffer. Confirm that max_slot_wal_keep_size is larger than that figure, or set it to -1 (unlimited) if disk space permits. Streamkap recommends a minimum 3-day WAL retention and 5 or more days if long snapshots are routine.7
Monitor the slot’s backlog directly during the snapshot:
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE slot_name = 'streamkap_slot';
Set an alert at 60% of your max_slot_wal_keep_size threshold. A page at 80% gives you a short window to act: you can either increase the setting, reduce the snapshot chunk rate to slow WAL accumulation, or schedule a brief connector pause to let the source catch up on log recycling.
For low-traffic databases, heartbeats add a second layer of protection. During an incremental snapshot, the streaming log position doesn’t advance because the connector is reading from the source table, not from the WAL. On a low-traffic database with few incoming changes, the slot position can stagnate even without a long snapshot. Heartbeat writes to a dedicated table keep the slot position advancing and prevent accumulation during quiet periods.
Production Validation: Correctness and Catch-Up Lag
A snapshot that reports completion isn’t necessarily correct. Verifying correctness is a separate step that most teams skip until a downstream pipeline breaks.
Start with row count comparison. After the snapshot completes, count rows per table at the source and compare against the destination. A discrepancy above 0.1% warrants investigation before the pipeline goes live. Silent truncations, connector restarts that dropped a chunk, and destination write failures that weren’t surfaced in connector metrics all show up here — before they appear as wrong numbers in a report.
For filtered snapshots, also verify the boundaries. If you snapshotted created_at >= '2026-06-01' AND created_at < '2026-06-08', confirm the destination contains rows from June 1 through June 7 and nothing outside that window from the snapshot pass. A filter misconfiguration that grabbed an extra week of data, or missed a boundary day, won’t cause a connector error. It produces wrong results silently.
After the snapshot finishes and streaming resumes, measure catch-up lag. The connector must stream all changes that accumulated during the snapshot window. For a 10-hour incremental snapshot on a busy table, that backlog can be substantial. Track lag until it returns to the pre-snapshot baseline. Lag that plateaus instead of declining points to a connector throughput problem separate from the snapshot itself — a different root cause requiring a different fix.
For stall detection, Streamkap exposes SnapshotRunning and SnapshotCompleted metrics via the API.8 Set an alert if SnapshotRunning for a given table exceeds twice the expected duration. A stalled incremental snapshot isn’t the same as a slow one. A slow snapshot continues making progress; a stalled one is stuck at a chunk boundary and needs a connector restart or an offset reset to recover. Distinguishing the two early saves hours of investigation.
Where to next?
- Snapshot modes and triggering — Streamkap docs — Filtered, Full, and Blocking options with configuration details
- PostgreSQL CDC connector — CDC from PostgreSQL with slot health monitoring built in
- Start a free 30-day trial
Footnotes
Related resources
What Is Data Synchronization and How It Works
Discover what is data synchronization and how it powers modern business by keeping data consistent across all systems for faster, smarter decisions.
Mastering Replication Of Data For Resilience And Analytics
Discover how replication of data enhances resilience, global availability, and analytics readiness with practical strategies, trade-offs, and best practices.
CDC Schema Evolution at Zero Downtime: A Practical Playbook
What happens when you ALTER TABLE with an active CDC pipeline. A practical playbook for column changes, schema registries, and safe deploys.