Why does Debezium need an initial snapshot?

Debezium captures changes from the database's transaction log (WAL in PostgreSQL, binlog in MySQL). But the transaction log only contains recent changes, not the full current state of every table. The initial snapshot reads all existing rows so your downstream system starts with a complete dataset. After the snapshot finishes, Debezium switches to streaming mode and captures only new changes going forward.

How long does a Debezium snapshot take?

It depends on your data volume, network bandwidth, and database load. As a rough guide: a 10GB database might snapshot in 10-30 minutes, a 100GB database in 1-4 hours, and a 1TB database in 10-40 hours. Row width matters too. Tables with large text or JSON columns snapshot slower than narrow tables with the same row count. Network latency between Kafka Connect and the database is often the biggest bottleneck.

Can I skip the initial snapshot?

Yes, using snapshot.mode=schema_only (Debezium 2.x) or snapshot.mode=no_data. This tells Debezium to capture only the table schemas, then immediately start streaming new changes. You would use this when you only care about new changes going forward, or when you have already loaded historical data through another method like a database dump or bulk export.

What happens if the snapshot is interrupted?

If Debezium crashes or restarts during a snapshot, the behavior depends on your version and configuration. Debezium 2.x with incremental snapshots can resume from the last completed chunk. With the older blocking snapshot mode, an interrupted snapshot typically restarts from the beginning. This is one reason incremental snapshots are preferred for large databases.

How does Streamkap handle initial snapshots?

Streamkap manages the entire snapshot process automatically, including progress monitoring, optimal parallelism settings, and automatic recovery if the snapshot is interrupted. For large tables, Streamkap uses incremental snapshot strategies that avoid long-running locks and can resume from interruptions. You configure which tables to capture and the platform handles the rest.

<--- Back to all resources

CDC

February 26, 2026

11 min read

Debezium Initial Snapshot: Strategies to Speed It Up

Debezium's initial snapshot can take hours or days on large databases. Learn about snapshot modes, performance bottlenecks, and practical strategies to get through the snapshot phase faster.

TL;DR: Debezium takes a point-in-time snapshot of existing data before it starts streaming changes. On large tables, this can take hours or days. Your main options: use schema_only mode if you only need new changes, run incremental snapshots via signal tables to avoid locking, tune snapshot.fetch.size and snapshot.max.threads for parallelism, and monitor progress with connector metrics. If the snapshot gets interrupted, Debezium can resume from where it left off in most configurations.

When you first start a Debezium connector, it does not jump straight to reading the transaction log. It reads the entire contents of every table you have configured for capture. This is the initial snapshot, and on large databases, it is the single biggest obstacle to getting CDC running in production.

A 50-million-row orders table does not snapshot in seconds. It can take hours. If the connector crashes mid-snapshot, you might have to start over. If the database is under heavy production load, the snapshot competes for I/O and memory with your application queries.

This guide covers what is happening during the snapshot, why it is slow, and what you can do about it.

Why Snapshots Exist

A database transaction log (the WAL in PostgreSQL, the binlog in MySQL) is not a complete record of your data. It is a record of recent changes. The log gets recycled as segments age out or checkpoints advance. If you started reading the transaction log right now, you would get INSERT, UPDATE, and DELETE events for new activity, but you would have no idea what already existed in the database before you started listening.

The snapshot fills that gap. Think of it like this: you want to record a meeting, but you arrived ten minutes late. The snapshot is the summary someone hands you of what was already discussed. The transaction log is the live recording going forward. Without that summary, you are missing context.

Debezium reads every row from every captured table via standard SELECT queries, publishes each row as a “read” event (with op: "r") to Kafka, and then switches to streaming mode to capture ongoing changes.

Snapshot Modes Explained

Debezium supports several snapshot modes that control whether a snapshot happens and how much data it includes. The exact mode names vary slightly between Debezium 1.x and 2.x, and between database connectors. Here are the most common ones:

Mode	Behavior	When to Use
`initial` (default)	Full snapshot of all tables, then switch to streaming	First deployment when you need all existing data
`initial_only`	Full snapshot, then stop (no streaming)	One-time data migration, backfills
`schema_only`	Capture table schemas but skip row data, then stream	You only care about new changes from this point forward
`when_needed`	Snapshot if no offset is found, otherwise resume streaming	Connector restarts, recovering from lost offsets
`never`	No snapshot at all, start streaming immediately	Offset already exists and you are certain it is valid

The initial mode is what most people start with. It gives you a complete picture: all existing rows plus all future changes. But it is also the slowest to start because it has to read everything.

If you have already loaded historical data through a pg_dump, a Spark job, or any other bulk method, schema_only lets you skip the snapshot entirely and start capturing changes from the current log position. This cuts startup time from hours to seconds.

The never mode is dangerous if you are not careful. If the stored offset is invalid or points to a log position that has already been recycled, the connector will miss data with no warning. Use it only when you are certain the offset exists and is current.

What Makes Snapshots Slow

The snapshot phase is essentially a full table scan of every table you are capturing. Several factors determine how long that takes.

Table Locking

Debezium needs a consistent view of the database at a single point in time. How it achieves that depends on the database.

On MySQL, the default behavior uses FLUSH TABLES WITH READ LOCK, which briefly locks the entire database to get a consistent snapshot position. The lock is released after Debezium records the binlog position, but on busy databases even a few seconds of global lock can cause connection timeouts and application errors. MySQL 8.0.17+ with the snapshot.locking.mode=minimal setting can reduce this by using lightweight locks on individual tables.

On PostgreSQL, Debezium uses the database’s MVCC (Multi-Version Concurrency Control) and an exported snapshot, which avoids global locks entirely. PostgreSQL handles this more gracefully, but the SELECT queries still consume significant I/O and memory.

Single-Threaded Reads

By default, Debezium snapshots tables one at a time, sequentially. If you have 200 tables, it will snapshot table 1, then table 2, then table 3, and so on. This is safe but slow. A database with 50 tables where each takes 5 minutes would need over 4 hours just to get through the snapshot phase.

Wide Rows and Large Columns

Row width has a direct effect on snapshot speed. A table with 5 integer columns per row moves through the wire much faster than a table with TEXT, JSONB, or BLOB columns that can hold megabytes per row. Debezium reads the full row by default, including every column, even if your downstream consumer only needs three of them.

Network Bandwidth

The snapshot data travels from the database to Kafka Connect over the network, gets serialized (often as Avro or JSON), and then gets written to Kafka. If Kafka Connect and the database are in different availability zones, the network hop adds latency per batch. For large snapshots, this adds up to hours of extra time.

Database Load

The snapshot’s SELECT queries compete with your production workload for disk I/O, buffer pool, and CPU. If your database is already running at 70% I/O capacity, the snapshot will slow down and so will your application queries.

Tuning Snapshot Performance

There are several connector configuration properties that directly affect snapshot speed.

snapshot.fetch.size

This controls how many rows Debezium fetches per round trip to the database. The default varies by connector (often 2,000 or 10,240 for PostgreSQL). Increasing it reduces the number of network round trips but uses more memory on the Kafka Connect worker.

{
  "snapshot.fetch.size": 10240
}

For large tables with narrow rows, try increasing to 20,000 or 50,000. For tables with wide rows (large TEXT or JSONB columns), you might need to decrease it to avoid out-of-memory errors on the Kafka Connect JVM.

snapshot.max.threads

This allows Debezium to snapshot multiple tables in parallel. The default is 1 (sequential). Setting it higher lets the connector read from several tables at the same time.

{
  "snapshot.max.threads": 4
}

A value of 4 means up to 4 tables are being read simultaneously. Be careful with this setting. Each thread holds a database connection and generates I/O. If your database is already under load, 4 parallel snapshot threads could push it over the edge. Start with 2 and increase if the database handles it well.

Note that snapshot.max.threads was introduced in Debezium 1.x and the behavior has evolved across versions. Check your specific version’s documentation, as some versions renamed or restructured this setting.

snapshot.select.statement.overrides

This is one of the most effective but underused settings. It lets you customize the SELECT query used during the snapshot for specific tables. You can filter rows, exclude columns, or add a WHERE clause.

{
  "snapshot.select.statement.overrides": "public.orders",
  "snapshot.select.statement.overrides.public.orders": "SELECT id, customer_id, status, total, created_at FROM public.orders WHERE created_at > '2024-01-01'"
}

This is useful when:

You have a large table but only need recent rows (filter by date)
The table has large columns you do not need (select specific columns)
You want to exclude soft-deleted rows (add WHERE deleted = false)

The tradeoff: your snapshot will not be a complete copy of the table. If your downstream system expects all historical data, this is not the right approach.

max.batch.size and max.queue.size

These control the internal buffer between Debezium’s snapshot reader and the Kafka producer. If the snapshot is reading faster than Kafka can write, increasing the queue size prevents the reader from blocking.

{
  "max.batch.size": 4096,
  "max.queue.size": 16384
}

The default max.queue.size is 8,192 events. If you see snapshot throughput plateau while the database still has spare capacity, increasing the queue can help. Make sure your Kafka Connect worker has enough heap memory to hold the larger queue.

Incremental Snapshots

The traditional (blocking) snapshot has a fundamental problem: it is all or nothing. It runs a series of SELECT queries, publishes the results, and if anything goes wrong, you restart from the beginning. Debezium 2.x introduced incremental snapshots to solve this.

How Incremental Snapshots Work

Instead of reading an entire table in one long query, incremental snapshots break the table into chunks based on the primary key. Debezium reads one chunk, publishes those events, then moves on to the next chunk. Between chunks, it processes any streaming events from the transaction log.

The process looks like this:

Read rows where id BETWEEN 1 AND 1000
Publish those 1,000 rows as snapshot events
Process any pending transaction log events (real-time changes)
Read rows where id BETWEEN 1001 AND 2000
Repeat until the table is fully read

This approach has several advantages:

No long-running locks. Each chunk query is short-lived.
Resumable. If the connector restarts, it picks up from the last completed chunk, not from the beginning.
Interleaved with streaming. You get real-time changes even while the snapshot is in progress.
On-demand. You can trigger a snapshot of a specific table at any time, not just at connector startup.

Setting Up the Signal Table

Incremental snapshots require a signal table in your source database. This is a table that Debezium watches for commands.

Create the signal table:

CREATE TABLE debezium_signal (
  id VARCHAR(42) PRIMARY KEY,
  type VARCHAR(32) NOT NULL,
  data VARCHAR(2048) NULL
);

Configure the connector to use it:

{
  "signal.data.collection": "public.debezium_signal",
  "signal.enabled.channels": "source"
}

Make sure the signal table is included in the connector’s table filter. If you are using table.include.list, add public.debezium_signal to the list.

Triggering an Ad-Hoc Snapshot

Once the signal table is in place, you can trigger a snapshot of any table at any time by inserting a row:

INSERT INTO debezium_signal (id, type, data)
VALUES (
  'snapshot-orders-001',
  'execute-snapshot',
  '{"data-collections": ["public.orders"], "type": "incremental"}'
);

Debezium picks up this signal and starts an incremental snapshot of the public.orders table. You can snapshot multiple tables at once by adding them to the data-collections array:

INSERT INTO debezium_signal (id, type, data)
VALUES (
  'snapshot-batch-001',
  'execute-snapshot',
  '{"data-collections": ["public.orders", "public.customers", "public.products"], "type": "incremental"}'
);

This is particularly useful when:

A new table is added to the capture list and needs its initial data loaded
You suspect data drift between the source and destination and want to re-snapshot a specific table
A previous snapshot was interrupted and you want to snapshot just the tables that were incomplete

Chunk Size

The default chunk size for incremental snapshots is 1,024 rows. You can adjust it:

{
  "incremental.snapshot.chunk.size": 2048
}

Larger chunks mean fewer round trips but longer individual queries. For tables with narrow rows, increase the chunk size. For tables with wide rows or heavy concurrent writes, keep it smaller to avoid contention.

Monitoring Snapshot Progress

A snapshot that takes 12 hours is manageable if you know it is progressing. A snapshot that might be stuck is a much bigger problem. Debezium exposes several metrics that help you track where things stand.

JMX Metrics

Debezium publishes snapshot metrics via JMX under the following object name pattern:

debezium.{connector-type}:type=connector-metrics,server={server-name},task={task-id},context=snapshot

For a PostgreSQL connector named my-pg-connector:

debezium.postgres:type=connector-metrics,server=my-pg-connector,task=0,context=snapshot

Key metrics to watch:

Metric	What It Tells You
`SnapshotRunning`	Boolean. Is the snapshot currently in progress?
`SnapshotCompleted`	Boolean. Has the snapshot finished?
`TotalNumberOfEventsSeen`	Total rows read so far across all tables
`NumberOfEventsFiltered`	Rows skipped by filters
`RemainingTableCount`	Tables left to snapshot
`RowsScanned`	Map of table name to rows scanned per table
`SnapshotDurationInSeconds`	Elapsed time since snapshot started

Estimating Time Remaining

If you know the total row count and the current throughput, you can estimate remaining time:

-- Get row counts for all tables being captured
SELECT schemaname, relname, n_live_tup
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers', 'products', 'events')
ORDER BY n_live_tup DESC;

Compare n_live_tup totals against TotalNumberOfEventsSeen from the JMX metrics. If Debezium has processed 10 million of 50 million rows in 2 hours, you are looking at roughly 8 more hours at the current rate.

Kafka Connect REST API

You can check connector status through the Kafka Connect REST API:

# Check connector status
curl -s http://localhost:8083/connectors/my-pg-connector/status | jq .

# Check task status (task 0)
curl -s http://localhost:8083/connectors/my-pg-connector/tasks/0/status | jq .

The response includes the connector state (RUNNING, PAUSED, FAILED) and any error messages. During a snapshot, the connector will show as RUNNING even though it has not started streaming yet. The JMX metrics are the only way to tell whether you are in snapshot or streaming mode.

Logging

Debezium logs snapshot progress at the INFO level. Look for messages like:

Snapshotting contents of 12 tables
Exporting data from table 'public.orders' (3 of 12 tables)
Finished exporting 2,450,000 records for table 'public.orders'; total duration '00:12:34.567'

If you are running in a containerized environment, make sure Debezium’s log level is set to at least INFO for the snapshot namespace:

log4j.logger.io.debezium.connector.postgresql.snapshot=INFO
log4j.logger.io.debezium.relational.RelationalSnapshotChangeEventSource=INFO

Planning Your Snapshot Strategy

The right snapshot approach depends on your data volume and your tolerance for downtime.

For small databases (under 10GB), the default initial mode with a few tuning parameters is usually fine. Set snapshot.fetch.size to 10,000 or higher, make sure your Kafka Connect worker has enough heap, and let it run.

For medium databases (10-100GB), start tuning seriously. Increase snapshot.max.threads to 2-4, filter out unnecessary columns with snapshot.select.statement.overrides, and schedule the snapshot during off-peak hours to minimize impact on your production database.

For large databases (100GB+), incremental snapshots are the way to go. They are resumable, they do not block streaming, and they can be triggered per-table. Set up the signal table, configure a reasonable chunk size, and monitor progress through JMX metrics.

Whatever your database size, test the snapshot in a staging environment first. Measure how long it takes, watch the impact on database metrics (CPU, I/O, replication lag), and verify that the data arrives correctly in your target system before running it against production.

Platform

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company