14 min read
PostgreSQL WAL Slot Management: Prevention and Recovery at Production Scale
Master logical replication slot lifecycle, WAL retention tuning, slot abandonment recovery, and failover mechanics to prevent silent CDC interruptions and data loss in production Postgres environments.
A replication slot goes inactive at 3:47 AM on a Tuesday. The connector that held it restarted after a transient network partition and couldn’t reconnect. PostgreSQL starts accumulating WAL from the slot’s last position. Six hours later, max_slot_wal_keep_size kicks in and PostgreSQL drops the slot to free disk space. The connector comes back at 8 AM, looks for its slot, finds nothing, and exits with an error.
By 11 AM, a data analyst files a ticket: three metric dimensions in the revenue dashboard haven’t updated since midnight.
This sequence is avoidable. The WAL disk pressure is predictable. The slot drop is configurable. The detection window is measurable. What’s missing is usually the operational layer: the right configuration parameters, the right monitoring queries, and a practiced recovery path before you need it at 11 AM.
Why Slots Matter: The Cost of Slot Loss
Logical replication slots exist to solve a specific problem. PostgreSQL continuously writes changes to WAL segments, then cleans them up aggressively. Each segment is 16MB, and the database reclaims them as soon as it thinks no consumer needs them. Without a slot, a CDC connector that goes offline for an hour can come back to find the WAL it needs has already been vacuumed away. There’s no way to replay those changes. The pipeline has a gap with no alarm to show for it.
A logical slot pins WAL at the slot’s restart_lsn. PostgreSQL won’t reclaim segments behind that position as long as the slot exists. The connector can go offline, restart, even be recreated on a new host, and it’ll resume from exactly where it left off - as long as the slot is still there.
That guarantee only holds while the slot exists and WAL accumulation stays within disk limits. When either breaks, CDC data is lost without any notification to the downstream systems. The connector doesn’t receive an alert that it missed changes. Downstream systems don’t know they stopped receiving updates. The failure surfaces later, in a report or a dashboard, when someone finally compares what’s in the warehouse against what’s on the source database.
Three operational realities compound this. First, the connector is often one of many services on a deployment schedule, meaning restarts during low-traffic windows are routine. Second, low-traffic periods are exactly when slot advancement stalls and WAL builds up fastest relative to consumption. Third, teams configure max_slot_wal_keep_size without a full picture of their write amplification rate - a limit that sounds generous (say, 10GB) can fill in under six hours during a batch write window.
The result: a team discovers they’ve been running a CDC pipeline with a systemic failure mode for months, and has no clear answer to “how much data did we miss?”
Replication Slot Lifecycle and Configuration
PostgreSQL supports two slot types. Physical replication slots track a WAL position for physical standbys, ensuring the primary doesn’t reclaim WAL the standby hasn’t applied. Logical replication slots do the same for logical decoders - tools like Streamkap’s PostgreSQL connector that transform WAL into structured row-level change events.
Logical slots carry more state than physical ones. Each slot has a plugin (the logical decoding plugin, typically pgoutput), a database (slots are database-scoped, not server-scoped), a restart_lsn (how far back WAL must be retained), and a confirmed_flush_lsn (the last LSN the consumer acknowledged). The gap between restart_lsn and pg_current_wal_lsn() is your effective lag window - and that number tells you how much WAL is currently pinned to disk by that slot.
Slot states
A slot is either active or inactive. It’s active when a consumer holds it open; you can see the consumer’s PID in pg_replication_slots.active_pid. It’s inactive when no consumer is connected. An inactive slot isn’t a problem by itself - the connector may be between reconnects or on a scheduled run. An inactive slot with a large restart_lsn gap is the problem. It’s quietly pinning WAL while nothing is consuming it.
Slots don’t expire on their own in PostgreSQL. An inactive slot sits indefinitely, pinning WAL, until something explicit happens: the consumer reconnects and advances it, an operator drops it manually, or PostgreSQL drops it automatically when WAL accumulation hits max_slot_wal_keep_size.
Parameters that control slot lifetime
Two configuration parameters directly affect how much runway you have when a slot goes inactive.
max_slot_wal_keep_size (introduced in PostgreSQL 13) caps how much WAL a single slot can retain before PostgreSQL drops it. The PostgreSQL runtime configuration reference describes it as the maximum WAL size replication slots are allowed to retain. The default is -1, which means no limit. No-limit is safer in the sense that your slot won’t be dropped automatically, but it means a stalled connector can fill your disk entirely.
Setting this to a nonzero value creates a real tradeoff. The slot gets cleaned up before it takes the database down, but a connector offline for longer than that WAL budget will lose its position. For a database generating 5GB of WAL per hour, a 20GB limit gives roughly four hours before the slot is dropped. The right value depends on your write amplification rate and your acceptable connector downtime window. Setting it too low creates frequent recovery events; setting it too high (or leaving it unlimited) creates disk risk.
wal_keep_size sets a floor for how much WAL the primary retains regardless of slot state. It doesn’t interact directly with max_slot_wal_keep_size, but both affect total WAL disk consumption. Streamkap’s WAL monitoring guide recommends a minimum of three days of WAL retention, with five days as a target when you need headroom for troubleshooting.
Setting wal_level = logical is a prerequisite for logical replication and requires a server restart. Confirm that max_wal_senders is at least twice your expected slot count - each active logical slot uses one WAL sender process. These settings should be in place before creating any logical slots, not added reactively.
Heartbeats and slot advancement
An inactive slot is unavoidable when the connector isn’t running. But some low-traffic databases have a subtler problem: the connector is running, but writes on monitored tables are so infrequent that the connector’s WAL position barely advances. PostgreSQL’s WAL cursor doesn’t move until the connector reads and acknowledges events, so a quiet database can have a technically-active slot that’s falling further behind on confirmed_flush_lsn.
Heartbeats solve this. Streamkap’s heartbeat configuration describes two protection layers. The connector emits internal heartbeat messages to keep its offset fresh. A database-side heartbeat writes periodic updates to a dedicated table to generate WAL events that the connector must process. For PostgreSQL 14 and later, there’s a cleaner alternative: logical-message heartbeats via pg_logical_emit_message(), which advances the slot without requiring any new table or scheduled job on the source database. This option is worth using when you don’t own the source database or want to avoid schema changes.
The recommended heartbeat interval is one minute for most deployments. That’s short enough to keep slots advancing through quiet periods, but low enough overhead that it doesn’t show up in query load.
Detection: Monitoring Slot Health and WAL Growth
The earlier you detect a slot drifting toward failure, the cheaper the recovery. Catching a slot that’s 2GB behind is a minor operational note. Catching it at 18GB, with max_slot_wal_keep_size set to 20GB, is a critical incident with a two-hour recovery window.
The core monitoring query
The pg_replication_slots view is the starting point. Query it regularly and compute pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) for each slot - this gives you the number of WAL bytes pinned by each slot. Sort descending to surface your most-at-risk slots first.
A useful alert threshold: warn when any slot’s retained WAL crosses 50% of your max_slot_wal_keep_size value, and page at 80%. That gives you two meaningful response windows before the automatic drop. If you’ve left max_slot_wal_keep_size at -1, anchor your alert to absolute disk consumption instead - warn when WAL utilization exceeds 60% of available disk, page at 80%.
For inactive slots specifically, also look at the active column. An inactive slot with a large WAL gap is a different problem from an active slot with the same gap. The inactive slot isn’t catching up; the active one will catch up as soon as write volume allows. Treat them as separate alert categories.
Metric integrations
Streamkap’s WAL monitoring guide covers the specific metrics to watch in common observability tools.
In Datadog, pg_stat_replication.flush_lag tracks replication lag per connected consumer, and postgresql.wal_bytes tracks total WAL disk utilization. Slot-level metrics, like WAL retained per slot, require a custom check querying pg_replication_slots directly.
In Grafana, the pg_stat_wal view’s wal_bytes column gives WAL generation rate. For slot health, query pg_replication_slots and expose active, restart_lsn, and confirmed_flush_lsn as panel metrics.
For Prometheus, the postgres_exporter (from the Prometheus community) exposes pg_replication_slots_wal_status and pg_replication_slots_active by default. These are often the cleanest path to slot alerting in a Kubernetes environment.
For cloud-managed databases where you can’t install exporters, write a lightweight SQL probe that runs on a schedule and logs slot state to a monitoring table. Alert on that table’s output rather than on the Postgres metrics layer directly.
Two alerts that cover the critical path
The first alert: WAL retained by inactive slots, in bytes. This catches the “connector offline” scenario before it becomes a disk issue. Set it to fire when any inactive slot’s retained WAL exceeds your chosen threshold.
The second alert: slot count by state. Alert if the number of active slots drops below the expected count for more than ten minutes. This catches connectors that have stopped without surfacing an obvious error - which is more common than it sounds after upgrades or infrastructure changes.
Avoid alerting on raw WAL disk utilization alone. Active, healthy slots accumulate WAL at the same rate as stalled ones. Without slot-level context, the metric doesn’t tell you whether action is needed.
Recovery from Abandoned or Dropped Slots
Recovery looks different depending on whether you catch the problem while the slot still exists or after PostgreSQL has already dropped it.
Diagnosing the current state
Start with pg_replication_slots. Look at active, restart_lsn, and confirmed_flush_lsn for each slot. An abandoned slot has active = false and a large gap between restart_lsn and the current WAL position. A dropped slot simply won’t appear in the view - if your connector’s expected slot name is missing, the slot is gone.
Cross-reference with pg_stat_replication. This view shows active WAL streaming connections. If the expected slot has no corresponding row here, the consumer is disconnected. The combination of “slot exists, consumer absent” is the abandoned-slot case. The combination of “slot missing” is the dropped-slot case.
The confirmed_flush_lsn column is also useful for understanding scope. It tells you the last WAL position the connector acknowledged. That’s the boundary between “we definitely have this data” and “this may be missing.” On recovery, you need to re-establish coverage from that LSN forward.
Recovering an abandoned slot
If the slot still exists and the connector was just offline, reconnecting is often enough. The connector resumes from its saved confirmed_flush_lsn, the slot advances, and WAL is released. Verify the slot goes active within a minute or two of reconnection by checking pg_replication_slots.active.
If WAL accumulated to near the max_slot_wal_keep_size limit and the slot is still present, don’t wait. Reconnect immediately or drop and recreate rather than risk the automatic drop. A slot sitting at 90% of the WAL limit is already in emergency territory - the window before PostgreSQL drops it automatically may be under an hour.
Recovering from a dropped slot
When the slot is gone, the connector has no valid WAL position to resume from. Create a new logical replication slot using pg_create_logical_replication_slot() with the appropriate plugin (pgoutput for most setups). Then determine which tables have a data gap - the gap spans from the connector’s last acknowledged confirmed_flush_lsn to the current WAL position.
The snapshot trade-off matters here. A full blocking snapshot pauses streaming for the affected tables and reads all rows in a single pass. It’s faster in elapsed time but not resumable. A network interruption or source-side failure halfway through means starting over from row one. For a table that takes six hours to snapshot, that’s a significant risk.
An incremental snapshot reads tables in chunks, allows streaming to continue concurrently, and resumes from the last committed chunk on failure. The total elapsed time is longer - potentially much longer for high-write tables, since chunk reads compete with live traffic - but each failure costs only the current chunk rather than the entire table. For tables under 50 million rows with moderate write rates, a blocking snapshot is usually the faster path. For larger tables or high-write tables, incremental is safer even if it takes longer.
After completing the re-snapshot, verify row counts on key tables against the source directly. A connector that reports success isn’t always right. A brief divergence check - even a simple count comparison on the five highest-traffic tables - is worth the two minutes it takes before you close the incident.
Cleaning up stale slots
One failure mode teams consistently overlook: a slot created during testing or setup that was never cleaned up. An idle test slot with active = false pins WAL just as effectively as a production slot. Audit your slot list regularly by querying pg_replication_slots for inactive slots and comparing against your known active connectors. Any inactive slot that’s been sitting for more than an hour without a known owner should be investigated before being dropped.
Dropping a slot is a one-liner against pg_drop_replication_slot(), but it’s irreversible. Confirm the slot owner before proceeding. The PostgreSQL documentation on pg_replication_slots explains what each column represents and how to interpret slot state.
Slot Management in Failover and High-Availability Setups
Replication slot behaviour in HA configurations surprises most teams the first time they encounter it. The documentation is clear, but the implications don’t surface until the first failover.
Slots don’t replicate
In PostgreSQL 15 and earlier, logical replication slots exist only on the primary. Physical streaming replication copies data pages and WAL between primary and standby, but it doesn’t copy slot state. When a standby is promoted - planned or otherwise - it starts with an empty slot list. The PostgreSQL warm standby documentation confirms this behaviour directly.
Every failover is therefore also a slot recovery event. The connector connects to the new primary, finds its slot missing, and fails. You then need to recreate the slot and re-snapshot affected tables, exactly as in the dropped-slot scenario above. For a system under continuous CDC load, an unplanned failover can mean hours of downtime if the re-snapshot operation isn’t scripted and practiced.
Plan for this explicitly. Keep your snapshot playbook documented alongside your HA runbook. After every failover, recreating slots and initiating re-snapshots should be one of the first post-promotion steps, not something figured out under pressure.
What PostgreSQL 16 changes
PostgreSQL 16 introduced the ability for physical standbys to hold logical replication slots in sync with the primary. With this configured, the standby tracks the primary’s slot state and keeps it current. On promotion, the slot is available immediately and the connector can resume without a re-snapshot.
The catch: not all managed services support it. Amazon RDS PostgreSQL, Google Cloud SQL, and Azure Database for PostgreSQL each have their own support matrix, and it varies by version and configuration. Check your specific service’s documentation before relying on this for a production HA setup. Discovering that your managed service doesn’t support logical standbys at promotion time is the wrong moment.
On self-managed PostgreSQL 16+ clusters, the configuration requires setting standby_slot_names on the primary and granting appropriate replication permissions. Test with a planned failover before you need this in production. The feature works, but “it should work” and “we’ve tested the failover” are not the same thing.
Checkpoint strategy and WAL pressure in HA
In an HA cluster, checkpoint frequency affects how quickly WAL is flushed to standbys and how much WAL must be retained if a standby lags. The interaction with replication slots creates compounding pressure: a lagging standby forces WAL retention, which increases disk utilization on the primary, which can trigger max_slot_wal_keep_size drops on your CDC slot even though the standby is the root problem.
Monitor standby lag via pg_stat_replication.flush_lag alongside slot lag. If both are growing simultaneously, the standby is almost certainly the cause. Addressing standby lag (network tuning, reducing write amplification, adjusting checkpoint_completion_target) often resolves the slot pressure downstream without touching any CDC configuration.
For Patroni-managed clusters, the maximum_lag_on_failover setting controls whether Patroni promotes a standby that’s significantly behind. Setting this conservatively relative to your WAL retention window means Patroni waits for the standby to catch up rather than promoting one that will immediately trigger a slot recovery operation. The precise value depends on your write rate and retention settings, but a starting point is to cap it at 20-30% of your WAL disk budget.
Where to next?
- Streamkap’s PostgreSQL connector - slot lifecycle management, heartbeats, and snapshot recovery built in
- Monitoring the PostgreSQL WAL Log - the complete WAL monitoring reference, including Datadog, Grafana, and Prometheus integration
- Heartbeat Configuration - setting up two-layer heartbeats to keep slots advancing on low-traffic databases
- Start a free 30-day trial - production-ready CDC on PostgreSQL with slot management handled for you
Related resources
What Is Snowflake Marketplace Capacity Drawdown Explained
What is Snowflake Marketplace Capacity Drawdown? This guide explains how it works, its benefits, and how to manage costs to maximize your Snowflake investment.
MySQL Binary Log CDC Deep Dive: Formats, GTIDs, and Replication
Technical deep dive into MySQL binary log internals for CDC — covering row-based logging, GTID replication, event structure, and purge policies.
PostgreSQL Logical Replication Internals: WAL, Slots, and Decoding
Deep dive into PostgreSQL logical replication internals covering WAL segments, logical decoding, replication slots, publications, and slot management.