<--- Back to all resources

CDC

February 26, 2026

12 min read

Debezium PostgreSQL Replication Slot Issues: Causes and Fixes

Replication slot bloat is the most dangerous failure mode in Debezium CDC pipelines. Learn why slots grow, how to monitor them, and what to do when WAL files fill your disk.

TL;DR: Debezium uses PostgreSQL logical replication slots to track CDC position. When Debezium stops consuming (connector crash, restart, network issue), the slot keeps WAL segments from being recycled. This causes WAL to grow until it fills the disk, potentially crashing your database. Monitor slot lag with pg_replication_slots, set max_slot_wal_keep_size as a safety net, and have a runbook for emergency slot cleanup.

If you run Debezium against PostgreSQL, you are running logical replication slots. If you have not yet had a replication slot incident, you either have excellent monitoring or you have not been running long enough. Slot bloat is the single most common way that CDC pipelines take down production databases, and it catches teams off guard because the failure happens on the database side, not the CDC side.

This guide covers how replication slots work in the context of Debezium, what goes wrong, how to detect problems before they become outages, and what to do when you are already in trouble.

How Debezium Uses Replication Slots

PostgreSQL’s logical decoding system allows external consumers to read changes from the write-ahead log (WAL) in a structured format. Instead of raw binary WAL records, logical decoding translates changes into a representation that includes table names, column values, and operation types (INSERT, UPDATE, DELETE).

Debezium creates one logical replication slot per connector. When you deploy a Debezium PostgreSQL connector, it issues something equivalent to:

SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');

The slot name typically matches the slot.name configuration property in your connector config. The slot serves two purposes:

  1. Position tracking. The slot records the LSN (Log Sequence Number) that the consumer has confirmed reading up to. This is the confirmed_flush_lsn in the pg_replication_slots view.
  2. WAL retention guarantee. PostgreSQL will not delete any WAL segment that contains data at or after the slot’s restart_lsn. This guarantees the consumer will never miss changes, even after a disconnect.

That retention guarantee is exactly what makes slots useful for CDC. It is also what makes them dangerous.

Under normal operation, Debezium reads changes from the slot, converts them to Kafka records, writes them to Kafka, and then advances the slot’s confirmed position. PostgreSQL sees the position advance and frees the corresponding WAL segments. The cycle is continuous: WAL is generated, consumed, and reclaimed.

The problem starts when that cycle breaks.

The Failure Mode: Unbounded WAL Growth

When Debezium stops consuming from a replication slot, the slot’s position stops advancing. PostgreSQL continues generating WAL from all write operations across the entire database (not just the tables Debezium is tracking). Every INSERT, UPDATE, DELETE, and even autovacuum activity produces WAL. But none of it can be reclaimed, because the slot says “I haven’t read past this point yet.”

The WAL growth rate depends entirely on your database’s write traffic. A lightly loaded database might generate a few megabytes per hour. A busy OLTP system doing thousands of transactions per second can produce 20 to 50 GB of WAL per hour. On high-throughput systems, you can go from “everything is fine” to “disk is 95% full” in under two hours.

When the disk fills up, PostgreSQL stops accepting writes. This is not a CDC problem at that point. It is a full database outage. Every application connected to that database starts failing. The irony is painful: a CDC connector that was supposed to be a passive reader of the database has now taken down the database itself.

The failure sequence looks like this:

  1. Debezium connector crashes or loses connectivity
  2. The replication slot stops advancing
  3. WAL accumulates on disk
  4. Disk usage climbs (often unmonitored until it is too late)
  5. PostgreSQL runs out of disk space
  6. Database rejects all write operations
  7. Application-level outages begin

Each step happens silently. PostgreSQL does not raise alarms about an inactive replication slot by default. You need to build that monitoring yourself.

Common Scenarios That Cause Slot Bloat

Replication slot bloat is always caused by the same root issue: the consumer stopped reading, but the slot stayed open. The reasons the consumer stops vary.

Connector crash without auto-restart. Kafka Connect does not automatically restart failed connector tasks by default. If a task enters the FAILED state, it sits there until someone manually restarts it via the REST API. Without monitoring on task status, the failure can go unnoticed for hours. Meanwhile, WAL piles up.

Kafka Connect worker OOM. Debezium connectors run inside Kafka Connect JVM workers. If a worker runs out of heap memory (common during large snapshots or when processing wide tables), the entire worker process dies. All connectors on that worker stop consuming. If the worker does not come back quickly, every replication slot associated with those connectors begins accumulating WAL.

Long-running transactions. This one is subtle. PostgreSQL’s logical decoding cannot skip past an open transaction. If any transaction on the database stays open for a long time (a migration, a poorly written batch job, an abandoned BEGIN in a psql session), logical decoding stalls at the LSN where that transaction started. Debezium may still be connected and “active,” but the slot’s restart_lsn cannot advance past the open transaction. WAL accumulates even though the connector appears healthy.

You can identify this with:

SELECT pid, state, xact_start, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY duration DESC;

Schema changes causing connector failure. Adding or dropping columns, changing data types, or renaming tables can cause Debezium to throw deserialization errors or enter a failed state. If the connector cannot handle the schema change, it stops consuming. The slot keeps accumulating.

Network issues between Kafka Connect and PostgreSQL. If the network connection between the Kafka Connect worker and PostgreSQL drops, the replication connection breaks. Debezium will attempt to reconnect, but depending on the connect.timeout.ms and retry configuration, it may give up and transition to FAILED. Again, the slot remains, and WAL grows.

Debezium version upgrades. Upgrading the Debezium connector plugin on Kafka Connect typically requires restarting workers. If the upgrade introduces a compatibility issue or the new version fails to start, the gap between “old connector stopped” and “new connector running” is a window for WAL accumulation. On a busy database, even 30 minutes of downtime can generate significant WAL.

Monitoring Replication Slots

The pg_replication_slots system view is your primary tool. Every production PostgreSQL database running CDC should have alerts based on this view.

Basic Slot Status

SELECT
  slot_name,
  plugin,
  slot_type,
  active,
  active_pid,
  restart_lsn,
  confirmed_flush_lsn
FROM pg_replication_slots;

Key columns:

  • slot_name: The name Debezium assigned to the slot (matches slot.name in connector config)
  • active: Whether a consumer is currently connected. false is an immediate red flag for CDC slots.
  • active_pid: The PID of the connected process. NULL if inactive.
  • restart_lsn: The oldest WAL position the slot needs. This is where PostgreSQL starts retaining WAL.
  • confirmed_flush_lsn: The position the consumer has confirmed reading up to.

Calculating Slot Lag

The gap between the current WAL position and the slot’s position tells you how much WAL is being retained:

SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS flush_lag
FROM pg_replication_slots
WHERE slot_type = 'logical';

The retained_wal column shows how much WAL PostgreSQL is holding because of this slot. If this number is growing and the slot is inactive, you have a problem that is getting worse by the minute.

WAL Directory Size

You can also check WAL disk usage directly:

SELECT pg_size_pretty(sum(size)) AS total_wal_size
FROM pg_ls_waldir();

Or from the command line:

du -sh /var/lib/postgresql/data/pg_wal/

Monitoring Queries for Alerting

Set up alerts on these conditions:

Any inactive CDC slot:

SELECT slot_name, restart_lsn
FROM pg_replication_slots
WHERE slot_type = 'logical'
  AND active = false;

Alert if this returns any rows.

Slot lag exceeding a threshold:

SELECT slot_name,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical'
  AND pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824; -- 1 GB

Adjust the threshold based on your disk capacity and write throughput. For a database generating 10 GB/hour of WAL with 100 GB of free disk, a 1 GB threshold gives you roughly 10 hours of warning.

Slot lag growth rate. Run the lag query at intervals and compare. If slot lag is increasing between checks, the consumer is either disconnected or falling behind. Both situations need attention.

Most monitoring tools (Datadog, Prometheus with postgres_exporter, CloudWatch for RDS) can run these queries on a schedule and trigger alerts. If you are running CDC in production and do not have slot lag alerting, stop reading this article and set it up now.

Emergency Recovery: When the Disk Is Full

If you are here because your disk is already full or nearly full, follow these steps in order.

Step 1: Identify the Problem Slot

SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

The slot retaining the most WAL is your problem. Confirm it belongs to Debezium (the plugin column will show pgoutput or decoderbufs).

Step 2: Try to Restart the Connector

If the Debezium connector simply crashed and can be restarted, that is the best outcome. It will resume consuming from its last position, and the slot lag will drain without data loss.

# Check connector status
curl -s http://kafka-connect:8083/connectors/my-pg-connector/status | jq .

# Restart the connector
curl -X POST http://kafka-connect:8083/connectors/my-pg-connector/restart

# Restart a specific failed task
curl -X POST http://kafka-connect:8083/connectors/my-pg-connector/tasks/0/restart

Monitor the slot lag after restart. If retained_wal starts decreasing, the connector is catching up and you are in the clear.

Step 3: Drop the Slot (Last Resort)

If the connector cannot be restarted (Kafka Connect is down, there is a configuration issue, the connector needs debugging), and disk space is critically low, drop the slot:

-- Check that the slot is inactive first
SELECT slot_name, active FROM pg_replication_slots WHERE slot_name = 'debezium';

-- Drop the slot (only if active = false)
SELECT pg_drop_replication_slot('debezium');

If the slot shows active = true but the connector is not actually consuming (zombie connection), you may need to terminate the backend process first:

-- Find and terminate the zombie connection
SELECT pg_terminate_backend(active_pid)
FROM pg_replication_slots
WHERE slot_name = 'debezium' AND active = true;

-- Then drop the slot
SELECT pg_drop_replication_slot('debezium');

Dropping the slot immediately releases all retained WAL. PostgreSQL will reclaim the disk space as it checkpoints.

Step 4: Plan the Re-Snapshot

After dropping the slot, Debezium has lost its position in the change stream. When you redeploy the connector, it will need to perform a full initial snapshot of all tracked tables. For large databases, this can take hours or even days.

Plan for this:

  • Schedule the re-snapshot during low-traffic periods
  • Consider using snapshot.mode=initial with snapshot.select.statement.overrides to limit which tables are snapshotted first
  • Monitor disk space during the snapshot phase, since snapshots also generate WAL

Prevention Strategies

Set max_slot_wal_keep_size (PostgreSQL 13+)

This is the single most important safety net. It sets an upper bound on how much WAL a replication slot can retain:

# postgresql.conf
max_slot_wal_keep_size = 50GB

When a slot’s retained WAL exceeds this limit, PostgreSQL invalidates the slot and allows WAL to be reclaimed. The downside is that the slot loses its position and the connector will need a re-snapshot. But a forced re-snapshot is far better than a full database outage.

After setting this parameter, you can verify it is working:

SELECT slot_name, wal_status
FROM pg_replication_slots;

If wal_status shows lost, the slot was invalidated due to exceeding the WAL limit. The connector will need to be recreated with a fresh snapshot.

On AWS RDS and Aurora, set this through a parameter group. On Google Cloud SQL, it is available as a database flag. Azure Database for PostgreSQL supports it through server parameters.

Configure Kafka Connect Auto-Restart

Starting with Kafka Connect 2.6 (and Debezium connectors that support it), you can configure automatic task restarts:

{
  "errors.retry.timeout": "300000",
  "errors.retry.delay.max.ms": "60000"
}

For newer Kafka Connect versions (3.x+), use the built-in restart behavior:

# Enable auto-restart with a delay
curl -X PUT http://kafka-connect:8083/connectors/my-pg-connector/config \
  -H "Content-Type: application/json" \
  -d '{
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "errors.retry.timeout": "-1",
    "errors.retry.delay.max.ms": "60000",
    ...
  }'

Setting errors.retry.timeout to -1 means retry indefinitely. The connector will keep attempting to restart rather than sitting in FAILED state while WAL accumulates.

Use Heartbeat Queries

Debezium supports a heartbeat mechanism that periodically writes a small record to a dedicated heartbeat topic. This keeps the replication connection active and advances the slot position even when no actual data changes are occurring on tracked tables:

{
  "heartbeat.interval.ms": "30000",
  "heartbeat.action.query": "INSERT INTO debezium_heartbeat (id, ts) VALUES (1, now()) ON CONFLICT (id) DO UPDATE SET ts = now()"
}

Heartbeats are particularly important when your tracked tables have low write activity but other tables are generating heavy WAL. Without heartbeats, the slot position only advances when Debezium processes changes for tracked tables. Untracked table WAL still accumulates in the gap.

Create the heartbeat table:

CREATE TABLE IF NOT EXISTS debezium_heartbeat (
  id INTEGER PRIMARY KEY,
  ts TIMESTAMP NOT NULL DEFAULT now()
);

Monitor Slot Lag in Your Alerting System

Integrate the monitoring queries from the previous section into your existing alerting stack. Specific recommendations:

Prometheus + postgres_exporter: The pg_replication_slots collector is included by default. Alert on pg_replication_slots_pg_wal_lsn_diff exceeding your threshold.

Datadog: Use a custom PostgreSQL check or the built-in replication metrics. Set monitors on postgresql.replication_slot.lag_bytes.

CloudWatch (RDS): The OldestReplicationSlotLag metric is available for RDS PostgreSQL instances. Set an alarm on this metric.

Regardless of the tool, the alert thresholds should be:

  • Warning: Slot lag > 1 GB or slot inactive for > 5 minutes
  • Critical: Slot lag > 10 GB or slot inactive for > 30 minutes

Adjust these based on your database’s write throughput and available disk space. The goal is enough lead time to react before the disk fills.

Test Failure and Recovery

Run chaos experiments on your CDC pipeline. Stop the Debezium connector intentionally and observe:

  • How quickly does your alerting detect the inactive slot?
  • How fast does WAL accumulate on your specific workload?
  • How long does it take to restart the connector and drain the lag?
  • What happens if the connector cannot restart and you need to drop the slot?

Knowing these numbers before an actual incident turns a 2am panic into a well-practiced runbook execution.

Building a Slot Management Runbook

Every team running Debezium against PostgreSQL should have a documented runbook that covers:

  1. Monitoring checks: Where to see slot status and lag (dashboards, queries, alerts)
  2. First response: How to restart a failed connector (API calls, credentials, access)
  3. Escalation: When to drop the slot and who can authorize it
  4. Recovery: How to redeploy the connector and manage the re-snapshot
  5. Post-incident: How to verify data consistency after a slot drop and re-snapshot

The queries and commands in this guide can serve as the technical foundation. Adapt them to your environment, test them regularly, and make sure more than one person on your team knows how to execute them. Replication slot incidents are not a question of “if.” They are a question of “when” and “how fast can you respond.”