<--- Back to all resources

Engineering

February 25, 2026

11 min read

PostgreSQL to Snowflake in Real Time: A Step-by-Step Guide

Learn how to stream data from PostgreSQL to Snowflake in real time using CDC. Compare approaches, understand architecture patterns, and build a sub-second latency pipeline.

TL;DR: • Real-time PostgreSQL-to-Snowflake pipelines use CDC (Change Data Capture) to read the PostgreSQL WAL and deliver changes to Snowflake with sub-second latency. • The key architectural choice is the delivery method: direct CDC-to-Snowflake, or CDC-to-Kafka-to-Snowflake for decoupling and replay. • Managed CDC platforms like Streamkap eliminate the operational overhead of running Debezium, Kafka Connect, and Snowpipe yourself.

Moving data from PostgreSQL to Snowflake is one of the most common data engineering tasks, and for good reason. PostgreSQL is a dominant OLTP database powering application backends, while Snowflake excels at analytical workloads. The question is no longer whether to replicate data between the two, but how fast you can do it. Batch ETL that runs every hour or overnight leaves your analytics stale and your dashboards out of sync with reality. Real-time CDC pipelines close that gap, giving analysts and operational systems access to data that is seconds old instead of hours old.

This guide walks through the architecture, configuration, and operational considerations for building a production-grade real-time pipeline from PostgreSQL to Snowflake using Change Data Capture.

Architecture Options

There are three primary approaches for moving data from PostgreSQL to Snowflake. Each involves different trade-offs around latency, complexity, and cost.

Direct CDC to Snowflake

A CDC connector reads the PostgreSQL WAL and writes change events directly to Snowflake using Snowpipe Streaming or staged file loads. This is the simplest architecture with the fewest moving parts.

CDC via Kafka to Snowflake

A Debezium-based connector captures changes into a Kafka topic. A separate Kafka Connect sink connector (or Snowpipe) then loads data from Kafka into Snowflake. This adds a durable message bus between source and destination, enabling replay and fan-out to multiple consumers.

Batch ETL

Traditional tools query PostgreSQL on a schedule (every 5 minutes, hourly, or daily) and load results into Snowflake. This is the simplest to set up but has the highest latency and puts periodic read load on the source database.

CriteriaDirect CDCCDC via KafkaBatch ETL
Latency5-15 seconds5-30 secondsMinutes to hours
Source impactMinimal (WAL read)Minimal (WAL read)High (query load)
Captures all changesYesYesNo (misses intermediate states)
Operational complexityLow-MediumHighLow
Replay capabilityNoYes (Kafka retention)Re-run query
Fan-out to multiple sinksNoYesSeparate pipelines

For most teams, direct CDC offers the best balance of simplicity and performance. The Kafka-based approach is justified when you need to route the same change stream to multiple destinations or require durable replay.

PostgreSQL Setup for CDC

Before any CDC tool can capture changes, PostgreSQL must be configured for logical replication. This involves three steps: WAL configuration, replication slot creation, and publication setup.

Enable Logical Replication

Set the WAL level to logical in your PostgreSQL configuration. This instructs PostgreSQL to include enough information in the WAL for row-level change decoding.

-- Check current WAL level
SHOW wal_level;

-- In postgresql.conf (requires restart)
wal_level = logical
max_replication_slots = 4    -- at least 1 per CDC connector
max_wal_senders = 4          -- at least 1 per CDC connector

On managed PostgreSQL services (Amazon RDS, Cloud SQL, Azure Database), you typically set this through a parameter group or configuration flag rather than editing postgresql.conf directly.

Create a Replication Slot and Publication

A replication slot ensures PostgreSQL retains WAL segments until the CDC consumer has processed them. A publication defines which tables are included in the change stream.

-- Create a dedicated replication user
CREATE ROLE cdc_user WITH REPLICATION LOGIN PASSWORD 'secure_password';

-- Grant read access to your tables
GRANT USAGE ON SCHEMA public TO cdc_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;

-- Create a publication for the tables you want to stream
CREATE PUBLICATION snowflake_pub FOR TABLE orders, customers, products;

-- Or publish all tables
CREATE PUBLICATION snowflake_pub FOR ALL TABLES;

A critical operational note: monitor replication slot lag. If the CDC consumer goes down and the slot is not consumed, PostgreSQL will retain WAL segments indefinitely, potentially filling the disk. Set max_slot_wal_keep_size (PostgreSQL 13+) to cap WAL retention.

Snowflake Ingestion Methods

On the Snowflake side, there are several ways to load streaming data. The right choice depends on your latency requirements and architecture.

Snowpipe (File-Based)

Snowpipe monitors an external stage (S3, GCS, Azure Blob) and automatically loads new files into a target table. The CDC tool writes micro-batch files to the stage, and Snowpipe picks them up. Typical latency is 30-90 seconds from file landing to queryable data.

Snowpipe Streaming

Snowpipe Streaming uses the Snowflake Ingest SDK to write rows directly to Snowflake without staging files. This eliminates the file-write-and-notify cycle, reducing ingestion latency to under 10 seconds. It is the preferred method for low-latency CDC pipelines.

COPY INTO

The COPY INTO command loads data from staged files on demand. It is a batch operation, not continuous, but can be scheduled at short intervals for near-real-time behavior.

Snowflake Connector for Kafka

If you use the Kafka-based architecture, Snowflake provides an official Kafka Connect sink connector. It reads from Kafka topics and uses either Snowpipe or Snowpipe Streaming under the hood.

-- Example: create target table in Snowflake
CREATE TABLE analytics.public.orders (
    id NUMBER,
    customer_id NUMBER,
    total_amount NUMBER(12, 2),
    status VARCHAR,
    created_at TIMESTAMP_NTZ,
    updated_at TIMESTAMP_NTZ,
    _cdc_operation VARCHAR,       -- INSERT, UPDATE, DELETE
    _cdc_timestamp TIMESTAMP_NTZ  -- when the change was captured
);

Building the Pipeline Step by Step

Regardless of the specific tools you choose, the pipeline follows a three-stage pattern: capture, transform, and load.

Stage 1: Capture

The CDC connector attaches to the PostgreSQL replication slot, decodes WAL entries, and emits structured change events. Each event contains the operation type (INSERT, UPDATE, DELETE), the full row after the change (and optionally the row before), and metadata like the transaction ID and commit timestamp.

Stage 2: Transform

Raw CDC events often require transformation before loading. Common transformations include flattening nested JSON columns, renaming fields to match Snowflake naming conventions, filtering out irrelevant tables or columns, and adding metadata columns like the CDC operation type and capture timestamp.

Stage 3: Load

Transformed events are delivered to Snowflake through one of the ingestion methods described above. The loader must handle idempotency (avoiding duplicate inserts if a batch is retried) and ordering (ensuring updates are applied in the correct sequence).

Schema Mapping Considerations

PostgreSQL and Snowflake have different type systems, and careful mapping prevents data loss and query errors.

PostgreSQL TypeSnowflake TypeNotes
INTEGER, BIGINTNUMBERDirect mapping
NUMERIC(p,s)NUMBER(p,s)Preserve precision and scale
VARCHAR(n)VARCHAR(n)Direct mapping
TEXTVARCHAR(16777216)Snowflake max VARCHAR
BOOLEANBOOLEANDirect mapping
TIMESTAMPTIMESTAMP_NTZNo timezone
TIMESTAMPTZTIMESTAMP_TZWith timezone
JSONB / JSONVARIANTEnables Snowflake semi-structured queries
ARRAYARRAY or VARIANTDepends on element types
UUIDVARCHAR(36)Stored as string
BYTEABINARYBinary data

Schema Evolution

Production tables change. Columns get added, types get altered, and tables get renamed. Your pipeline must handle schema evolution gracefully. At a minimum, it should automatically add new columns to the Snowflake target table when they appear in PostgreSQL. Managed platforms like Streamkap handle schema evolution automatically, propagating ALTER TABLE ADD COLUMN changes from PostgreSQL to Snowflake without manual intervention or pipeline restarts.

Handling Deletes and Updates

CDC captures every operation, but how you materialize deletes and updates in Snowflake requires deliberate design.

Soft Deletes

Instead of physically deleting rows in Snowflake, append a _deleted flag and a _deleted_at timestamp. This preserves historical data and simplifies auditing.

-- Soft delete approach: mark deleted rows
MERGE INTO analytics.orders AS target
USING staging.orders_cdc AS source
ON target.id = source.id
WHEN MATCHED AND source._cdc_operation = 'DELETE' THEN
    UPDATE SET target._deleted = TRUE, target._deleted_at = CURRENT_TIMESTAMP()
WHEN MATCHED AND source._cdc_operation = 'UPDATE' THEN
    UPDATE SET target.total_amount = source.total_amount,
               target.status = source.status,
               target.updated_at = source.updated_at
WHEN NOT MATCHED AND source._cdc_operation != 'DELETE' THEN
    INSERT (id, customer_id, total_amount, status, created_at, updated_at)
    VALUES (source.id, source.customer_id, source.total_amount, source.status,
            source.created_at, source.updated_at);

SCD Type 2

For dimension tables where you need to track historical attribute values, implement Slowly Changing Dimension Type 2 by maintaining valid_from, valid_to, and is_current columns. Each update closes the existing record and inserts a new one rather than overwriting.

Self-Managed vs Managed Platforms

Running a CDC pipeline in production is an ongoing operational commitment. Here is what the two paths look like in practice.

Self-Managed (Debezium + Kafka + Snowpipe)

You deploy and maintain Debezium (via Kafka Connect), a Kafka cluster, and either the Snowflake Kafka connector or a custom Snowpipe integration. This gives you full control but requires:

  • Kafka cluster management (brokers, ZooKeeper/KRaft, topic configuration)
  • Debezium connector configuration and monitoring
  • Schema Registry for Avro/JSON schema management
  • Snowflake connector tuning and credential rotation
  • Alerting on replication lag, consumer lag, and connector failures
  • Handling connector restarts, rebalances, and offset management

Managed (Streamkap)

A managed CDC platform abstracts the infrastructure layer. With Streamkap, you configure the PostgreSQL source and Snowflake destination through a UI or API. The platform handles WAL reading, change capture, schema mapping, exactly-once delivery, and automatic schema evolution. The operational surface shrinks from managing a distributed system to monitoring a dashboard.

DimensionSelf-ManagedManaged (Streamkap)
Setup timeDays to weeksMinutes
Ongoing maintenanceHigh (patching, scaling, monitoring)Minimal
Latency5-30 seconds (tunable)Sub-second capture, seconds to Snowflake
Schema evolutionManual or custom scriptsAutomatic
Exactly-once deliveryComplex to guaranteeBuilt-in
Cost modelInfrastructure + engineering timeUsage-based

Performance Tuning

Once the pipeline is running, several parameters affect throughput and latency.

Batch Size and Flush Interval

Most CDC connectors batch events before writing to the destination. A smaller batch size reduces latency but increases the number of write operations (and Snowflake credits consumed). A larger batch improves throughput efficiency but adds latency.

# Typical tuning parameters (connector-specific)
batch.size=1000              # events per batch
flush.interval.ms=5000       # maximum time before flushing

Start with a 5-second flush interval and 1,000-event batch size, then adjust based on observed latency and cost.

Parallelism

Partition your change stream by table or by primary key range to enable parallel loading. Snowpipe Streaming supports multiple channels writing to the same table concurrently, which can dramatically improve throughput for high-volume tables.

Snowflake Warehouse Sizing

If you use MERGE statements or COPY INTO for loading, the virtual warehouse size directly affects load performance. For continuous CDC loads, an X-Small warehouse is typically sufficient for moderate volumes (up to a few thousand events per second). Scale up for initial backfill or catch-up scenarios.

-- Create a dedicated warehouse for CDC ingestion
CREATE WAREHOUSE cdc_loader
    WAREHOUSE_SIZE = 'X-SMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    INITIALLY_SUSPENDED = TRUE;

Monitoring

A real-time pipeline requires real-time monitoring. Track these metrics to detect issues before they affect downstream consumers.

Source Side (PostgreSQL)

  • Replication slot lag: The difference between the current WAL position and the slot’s confirmed flush position. Growing lag means the consumer is falling behind.
  • WAL disk usage: Monitor pg_replication_slots and WAL directory size. Unbounded growth indicates a stalled consumer.
-- Check replication slot lag
SELECT slot_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots;

Pipeline Metrics

  • Events per second: Throughput at each stage (capture, transform, load).
  • End-to-end latency: Time from PostgreSQL commit to data queryable in Snowflake. Measure this by inserting a sentinel row with a timestamp and querying it in Snowflake.
  • Error rate: Failed deliveries, schema mismatches, or serialization errors.

Destination Side (Snowflake)

  • Snowpipe load history: Query INFORMATION_SCHEMA.LOAD_HISTORY or SNOWPIPE_STREAMING_FILE_MIGRATION_HISTORY to track load latency and row counts.
  • Row count drift: Periodically compare source and destination row counts to detect silent data loss.
-- Check recent Snowpipe load history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
    TABLE_NAME => 'ORDERS',
    START_TIME => DATEADD(HOUR, -1, CURRENT_TIMESTAMP())
))
ORDER BY LAST_LOAD_TIME DESC;

Set alerts on replication slot lag exceeding 100 MB, end-to-end latency exceeding your SLA threshold, and any non-zero error rates. These three signals catch the vast majority of pipeline failures before they cascade into stale data or missing records.

Final Thoughts

Building a real-time PostgreSQL to Snowflake pipeline is a well-understood problem with mature tooling. The core steps are consistent regardless of implementation: enable logical replication in PostgreSQL, capture changes via CDC, transform events to match your Snowflake schema, and load using Snowpipe Streaming or a Kafka-based sink. The architectural decision that matters most is whether you manage the infrastructure yourself or use a managed platform like Streamkap to eliminate the operational overhead. Either way, the result is fresh, reliable data in Snowflake, measured in seconds, not hours.