<--- Back to all resources
Debezium Initial Snapshot: Strategies to Speed It Up
Debezium's initial snapshot can take hours or days on large databases. Learn about snapshot modes, performance bottlenecks, and practical strategies to get through the snapshot phase faster.
When you first start a Debezium connector, it does not jump straight to reading the transaction log. It reads the entire contents of every table you have configured for capture. This is the initial snapshot, and on large databases, it is the single biggest obstacle to getting CDC running in production.
A 50-million-row orders table does not snapshot in seconds. It can take hours. If the connector crashes mid-snapshot, you might have to start over. If the database is under heavy production load, the snapshot competes for I/O and memory with your application queries.
This guide covers what is happening during the snapshot, why it is slow, and what you can do about it.
Why Snapshots Exist
A database transaction log (the WAL in PostgreSQL, the binlog in MySQL) is not a complete record of your data. It is a record of recent changes. The log gets recycled as segments age out or checkpoints advance. If you started reading the transaction log right now, you would get INSERT, UPDATE, and DELETE events for new activity, but you would have no idea what already existed in the database before you started listening.
The snapshot fills that gap. Think of it like this: you want to record a meeting, but you arrived ten minutes late. The snapshot is the summary someone hands you of what was already discussed. The transaction log is the live recording going forward. Without that summary, you are missing context.
Debezium reads every row from every captured table via standard SELECT queries, publishes each row as a “read” event (with op: "r") to Kafka, and then switches to streaming mode to capture ongoing changes.
Snapshot Modes Explained
Debezium supports several snapshot modes that control whether a snapshot happens and how much data it includes. The exact mode names vary slightly between Debezium 1.x and 2.x, and between database connectors. Here are the most common ones:
| Mode | Behavior | When to Use |
|---|---|---|
initial (default) | Full snapshot of all tables, then switch to streaming | First deployment when you need all existing data |
initial_only | Full snapshot, then stop (no streaming) | One-time data migration, backfills |
schema_only | Capture table schemas but skip row data, then stream | You only care about new changes from this point forward |
when_needed | Snapshot if no offset is found, otherwise resume streaming | Connector restarts, recovering from lost offsets |
never | No snapshot at all, start streaming immediately | Offset already exists and you are certain it is valid |
The initial mode is what most people start with. It gives you a complete picture: all existing rows plus all future changes. But it is also the slowest to start because it has to read everything.
If you have already loaded historical data through a pg_dump, a Spark job, or any other bulk method, schema_only lets you skip the snapshot entirely and start capturing changes from the current log position. This cuts startup time from hours to seconds.
The never mode is dangerous if you are not careful. If the stored offset is invalid or points to a log position that has already been recycled, the connector will miss data with no warning. Use it only when you are certain the offset exists and is current.
What Makes Snapshots Slow
The snapshot phase is essentially a full table scan of every table you are capturing. Several factors determine how long that takes.
Table Locking
Debezium needs a consistent view of the database at a single point in time. How it achieves that depends on the database.
On MySQL, the default behavior uses FLUSH TABLES WITH READ LOCK, which briefly locks the entire database to get a consistent snapshot position. The lock is released after Debezium records the binlog position, but on busy databases even a few seconds of global lock can cause connection timeouts and application errors. MySQL 8.0.17+ with the snapshot.locking.mode=minimal setting can reduce this by using lightweight locks on individual tables.
On PostgreSQL, Debezium uses the database’s MVCC (Multi-Version Concurrency Control) and an exported snapshot, which avoids global locks entirely. PostgreSQL handles this more gracefully, but the SELECT queries still consume significant I/O and memory.
Single-Threaded Reads
By default, Debezium snapshots tables one at a time, sequentially. If you have 200 tables, it will snapshot table 1, then table 2, then table 3, and so on. This is safe but slow. A database with 50 tables where each takes 5 minutes would need over 4 hours just to get through the snapshot phase.
Wide Rows and Large Columns
Row width has a direct effect on snapshot speed. A table with 5 integer columns per row moves through the wire much faster than a table with TEXT, JSONB, or BLOB columns that can hold megabytes per row. Debezium reads the full row by default, including every column, even if your downstream consumer only needs three of them.
Network Bandwidth
The snapshot data travels from the database to Kafka Connect over the network, gets serialized (often as Avro or JSON), and then gets written to Kafka. If Kafka Connect and the database are in different availability zones, the network hop adds latency per batch. For large snapshots, this adds up to hours of extra time.
Database Load
The snapshot’s SELECT queries compete with your production workload for disk I/O, buffer pool, and CPU. If your database is already running at 70% I/O capacity, the snapshot will slow down and so will your application queries.
Tuning Snapshot Performance
There are several connector configuration properties that directly affect snapshot speed.
snapshot.fetch.size
This controls how many rows Debezium fetches per round trip to the database. The default varies by connector (often 2,000 or 10,240 for PostgreSQL). Increasing it reduces the number of network round trips but uses more memory on the Kafka Connect worker.
{
"snapshot.fetch.size": 10240
}
For large tables with narrow rows, try increasing to 20,000 or 50,000. For tables with wide rows (large TEXT or JSONB columns), you might need to decrease it to avoid out-of-memory errors on the Kafka Connect JVM.
snapshot.max.threads
This allows Debezium to snapshot multiple tables in parallel. The default is 1 (sequential). Setting it higher lets the connector read from several tables at the same time.
{
"snapshot.max.threads": 4
}
A value of 4 means up to 4 tables are being read simultaneously. Be careful with this setting. Each thread holds a database connection and generates I/O. If your database is already under load, 4 parallel snapshot threads could push it over the edge. Start with 2 and increase if the database handles it well.
Note that snapshot.max.threads was introduced in Debezium 1.x and the behavior has evolved across versions. Check your specific version’s documentation, as some versions renamed or restructured this setting.
snapshot.select.statement.overrides
This is one of the most effective but underused settings. It lets you customize the SELECT query used during the snapshot for specific tables. You can filter rows, exclude columns, or add a WHERE clause.
{
"snapshot.select.statement.overrides": "public.orders",
"snapshot.select.statement.overrides.public.orders": "SELECT id, customer_id, status, total, created_at FROM public.orders WHERE created_at > '2024-01-01'"
}
This is useful when:
- You have a large table but only need recent rows (filter by date)
- The table has large columns you do not need (select specific columns)
- You want to exclude soft-deleted rows (add
WHERE deleted = false)
The tradeoff: your snapshot will not be a complete copy of the table. If your downstream system expects all historical data, this is not the right approach.
max.batch.size and max.queue.size
These control the internal buffer between Debezium’s snapshot reader and the Kafka producer. If the snapshot is reading faster than Kafka can write, increasing the queue size prevents the reader from blocking.
{
"max.batch.size": 4096,
"max.queue.size": 16384
}
The default max.queue.size is 8,192 events. If you see snapshot throughput plateau while the database still has spare capacity, increasing the queue can help. Make sure your Kafka Connect worker has enough heap memory to hold the larger queue.
Incremental Snapshots
The traditional (blocking) snapshot has a fundamental problem: it is all or nothing. It runs a series of SELECT queries, publishes the results, and if anything goes wrong, you restart from the beginning. Debezium 2.x introduced incremental snapshots to solve this.
How Incremental Snapshots Work
Instead of reading an entire table in one long query, incremental snapshots break the table into chunks based on the primary key. Debezium reads one chunk, publishes those events, then moves on to the next chunk. Between chunks, it processes any streaming events from the transaction log.
The process looks like this:
- Read rows where
id BETWEEN 1 AND 1000 - Publish those 1,000 rows as snapshot events
- Process any pending transaction log events (real-time changes)
- Read rows where
id BETWEEN 1001 AND 2000 - Repeat until the table is fully read
This approach has several advantages:
- No long-running locks. Each chunk query is short-lived.
- Resumable. If the connector restarts, it picks up from the last completed chunk, not from the beginning.
- Interleaved with streaming. You get real-time changes even while the snapshot is in progress.
- On-demand. You can trigger a snapshot of a specific table at any time, not just at connector startup.
Setting Up the Signal Table
Incremental snapshots require a signal table in your source database. This is a table that Debezium watches for commands.
Create the signal table:
CREATE TABLE debezium_signal (
id VARCHAR(42) PRIMARY KEY,
type VARCHAR(32) NOT NULL,
data VARCHAR(2048) NULL
);
Configure the connector to use it:
{
"signal.data.collection": "public.debezium_signal",
"signal.enabled.channels": "source"
}
Make sure the signal table is included in the connector’s table filter. If you are using table.include.list, add public.debezium_signal to the list.
Triggering an Ad-Hoc Snapshot
Once the signal table is in place, you can trigger a snapshot of any table at any time by inserting a row:
INSERT INTO debezium_signal (id, type, data)
VALUES (
'snapshot-orders-001',
'execute-snapshot',
'{"data-collections": ["public.orders"], "type": "incremental"}'
);
Debezium picks up this signal and starts an incremental snapshot of the public.orders table. You can snapshot multiple tables at once by adding them to the data-collections array:
INSERT INTO debezium_signal (id, type, data)
VALUES (
'snapshot-batch-001',
'execute-snapshot',
'{"data-collections": ["public.orders", "public.customers", "public.products"], "type": "incremental"}'
);
This is particularly useful when:
- A new table is added to the capture list and needs its initial data loaded
- You suspect data drift between the source and destination and want to re-snapshot a specific table
- A previous snapshot was interrupted and you want to snapshot just the tables that were incomplete
Chunk Size
The default chunk size for incremental snapshots is 1,024 rows. You can adjust it:
{
"incremental.snapshot.chunk.size": 2048
}
Larger chunks mean fewer round trips but longer individual queries. For tables with narrow rows, increase the chunk size. For tables with wide rows or heavy concurrent writes, keep it smaller to avoid contention.
Monitoring Snapshot Progress
A snapshot that takes 12 hours is manageable if you know it is progressing. A snapshot that might be stuck is a much bigger problem. Debezium exposes several metrics that help you track where things stand.
JMX Metrics
Debezium publishes snapshot metrics via JMX under the following object name pattern:
debezium.{connector-type}:type=connector-metrics,server={server-name},task={task-id},context=snapshot
For a PostgreSQL connector named my-pg-connector:
debezium.postgres:type=connector-metrics,server=my-pg-connector,task=0,context=snapshot
Key metrics to watch:
| Metric | What It Tells You |
|---|---|
SnapshotRunning | Boolean. Is the snapshot currently in progress? |
SnapshotCompleted | Boolean. Has the snapshot finished? |
TotalNumberOfEventsSeen | Total rows read so far across all tables |
NumberOfEventsFiltered | Rows skipped by filters |
RemainingTableCount | Tables left to snapshot |
RowsScanned | Map of table name to rows scanned per table |
SnapshotDurationInSeconds | Elapsed time since snapshot started |
Estimating Time Remaining
If you know the total row count and the current throughput, you can estimate remaining time:
-- Get row counts for all tables being captured
SELECT schemaname, relname, n_live_tup
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers', 'products', 'events')
ORDER BY n_live_tup DESC;
Compare n_live_tup totals against TotalNumberOfEventsSeen from the JMX metrics. If Debezium has processed 10 million of 50 million rows in 2 hours, you are looking at roughly 8 more hours at the current rate.
Kafka Connect REST API
You can check connector status through the Kafka Connect REST API:
# Check connector status
curl -s http://localhost:8083/connectors/my-pg-connector/status | jq .
# Check task status (task 0)
curl -s http://localhost:8083/connectors/my-pg-connector/tasks/0/status | jq .
The response includes the connector state (RUNNING, PAUSED, FAILED) and any error messages. During a snapshot, the connector will show as RUNNING even though it has not started streaming yet. The JMX metrics are the only way to tell whether you are in snapshot or streaming mode.
Logging
Debezium logs snapshot progress at the INFO level. Look for messages like:
Snapshotting contents of 12 tables
Exporting data from table 'public.orders' (3 of 12 tables)
Finished exporting 2,450,000 records for table 'public.orders'; total duration '00:12:34.567'
If you are running in a containerized environment, make sure Debezium’s log level is set to at least INFO for the snapshot namespace:
log4j.logger.io.debezium.connector.postgresql.snapshot=INFO
log4j.logger.io.debezium.relational.RelationalSnapshotChangeEventSource=INFO
Planning Your Snapshot Strategy
The right snapshot approach depends on your data volume and your tolerance for downtime.
For small databases (under 10GB), the default initial mode with a few tuning parameters is usually fine. Set snapshot.fetch.size to 10,000 or higher, make sure your Kafka Connect worker has enough heap, and let it run.
For medium databases (10-100GB), start tuning seriously. Increase snapshot.max.threads to 2-4, filter out unnecessary columns with snapshot.select.statement.overrides, and schedule the snapshot during off-peak hours to minimize impact on your production database.
For large databases (100GB+), incremental snapshots are the way to go. They are resumable, they do not block streaming, and they can be triggered per-table. Set up the signal table, configure a reasonable chunk size, and monitor progress through JMX metrics.
Whatever your database size, test the snapshot in a staging environment first. Measure how long it takes, watch the impact on database metrics (CPU, I/O, replication lag), and verify that the data arrives correctly in your target system before running it against production.