<--- Back to all resources
PostgreSQL to Snowflake in Real Time: A Step-by-Step Guide
Learn how to stream data from PostgreSQL to Snowflake in real time using CDC. Compare approaches, understand architecture patterns, and build a sub-second latency pipeline.
Moving data from PostgreSQL to Snowflake is one of the most common data engineering tasks, and for good reason. PostgreSQL is a dominant OLTP database powering application backends, while Snowflake excels at analytical workloads. The question is no longer whether to replicate data between the two, but how fast you can do it. Batch ETL that runs every hour or overnight leaves your analytics stale and your dashboards out of sync with reality. Real-time CDC pipelines close that gap, giving analysts and operational systems access to data that is seconds old instead of hours old.
This guide walks through the architecture, configuration, and operational considerations for building a production-grade real-time pipeline from PostgreSQL to Snowflake using Change Data Capture.
Architecture Options
There are three primary approaches for moving data from PostgreSQL to Snowflake. Each involves different trade-offs around latency, complexity, and cost.
Direct CDC to Snowflake
A CDC connector reads the PostgreSQL WAL and writes change events directly to Snowflake using Snowpipe Streaming or staged file loads. This is the simplest architecture with the fewest moving parts.
CDC via Kafka to Snowflake
A Debezium-based connector captures changes into a Kafka topic. A separate Kafka Connect sink connector (or Snowpipe) then loads data from Kafka into Snowflake. This adds a durable message bus between source and destination, enabling replay and fan-out to multiple consumers.
Batch ETL
Traditional tools query PostgreSQL on a schedule (every 5 minutes, hourly, or daily) and load results into Snowflake. This is the simplest to set up but has the highest latency and puts periodic read load on the source database.
| Criteria | Direct CDC | CDC via Kafka | Batch ETL |
|---|---|---|---|
| Latency | 5-15 seconds | 5-30 seconds | Minutes to hours |
| Source impact | Minimal (WAL read) | Minimal (WAL read) | High (query load) |
| Captures all changes | Yes | Yes | No (misses intermediate states) |
| Operational complexity | Low-Medium | High | Low |
| Replay capability | No | Yes (Kafka retention) | Re-run query |
| Fan-out to multiple sinks | No | Yes | Separate pipelines |
For most teams, direct CDC offers the best balance of simplicity and performance. The Kafka-based approach is justified when you need to route the same change stream to multiple destinations or require durable replay.
PostgreSQL Setup for CDC
Before any CDC tool can capture changes, PostgreSQL must be configured for logical replication. This involves three steps: WAL configuration, replication slot creation, and publication setup.
Enable Logical Replication
Set the WAL level to logical in your PostgreSQL configuration. This instructs PostgreSQL to include enough information in the WAL for row-level change decoding.
-- Check current WAL level
SHOW wal_level;
-- In postgresql.conf (requires restart)
wal_level = logical
max_replication_slots = 4 -- at least 1 per CDC connector
max_wal_senders = 4 -- at least 1 per CDC connector
On managed PostgreSQL services (Amazon RDS, Cloud SQL, Azure Database), you typically set this through a parameter group or configuration flag rather than editing postgresql.conf directly.
Create a Replication Slot and Publication
A replication slot ensures PostgreSQL retains WAL segments until the CDC consumer has processed them. A publication defines which tables are included in the change stream.
-- Create a dedicated replication user
CREATE ROLE cdc_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
-- Grant read access to your tables
GRANT USAGE ON SCHEMA public TO cdc_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;
-- Create a publication for the tables you want to stream
CREATE PUBLICATION snowflake_pub FOR TABLE orders, customers, products;
-- Or publish all tables
CREATE PUBLICATION snowflake_pub FOR ALL TABLES;
A critical operational note: monitor replication slot lag. If the CDC consumer goes down and the slot is not consumed, PostgreSQL will retain WAL segments indefinitely, potentially filling the disk. Set max_slot_wal_keep_size (PostgreSQL 13+) to cap WAL retention.
Snowflake Ingestion Methods
On the Snowflake side, there are several ways to load streaming data. The right choice depends on your latency requirements and architecture.
Snowpipe (File-Based)
Snowpipe monitors an external stage (S3, GCS, Azure Blob) and automatically loads new files into a target table. The CDC tool writes micro-batch files to the stage, and Snowpipe picks them up. Typical latency is 30-90 seconds from file landing to queryable data.
Snowpipe Streaming
Snowpipe Streaming uses the Snowflake Ingest SDK to write rows directly to Snowflake without staging files. This eliminates the file-write-and-notify cycle, reducing ingestion latency to under 10 seconds. It is the preferred method for low-latency CDC pipelines.
COPY INTO
The COPY INTO command loads data from staged files on demand. It is a batch operation, not continuous, but can be scheduled at short intervals for near-real-time behavior.
Snowflake Connector for Kafka
If you use the Kafka-based architecture, Snowflake provides an official Kafka Connect sink connector. It reads from Kafka topics and uses either Snowpipe or Snowpipe Streaming under the hood.
-- Example: create target table in Snowflake
CREATE TABLE analytics.public.orders (
id NUMBER,
customer_id NUMBER,
total_amount NUMBER(12, 2),
status VARCHAR,
created_at TIMESTAMP_NTZ,
updated_at TIMESTAMP_NTZ,
_cdc_operation VARCHAR, -- INSERT, UPDATE, DELETE
_cdc_timestamp TIMESTAMP_NTZ -- when the change was captured
);
Building the Pipeline Step by Step
Regardless of the specific tools you choose, the pipeline follows a three-stage pattern: capture, transform, and load.
Stage 1: Capture
The CDC connector attaches to the PostgreSQL replication slot, decodes WAL entries, and emits structured change events. Each event contains the operation type (INSERT, UPDATE, DELETE), the full row after the change (and optionally the row before), and metadata like the transaction ID and commit timestamp.
Stage 2: Transform
Raw CDC events often require transformation before loading. Common transformations include flattening nested JSON columns, renaming fields to match Snowflake naming conventions, filtering out irrelevant tables or columns, and adding metadata columns like the CDC operation type and capture timestamp.
Stage 3: Load
Transformed events are delivered to Snowflake through one of the ingestion methods described above. The loader must handle idempotency (avoiding duplicate inserts if a batch is retried) and ordering (ensuring updates are applied in the correct sequence).
Schema Mapping Considerations
PostgreSQL and Snowflake have different type systems, and careful mapping prevents data loss and query errors.
| PostgreSQL Type | Snowflake Type | Notes |
|---|---|---|
INTEGER, BIGINT | NUMBER | Direct mapping |
NUMERIC(p,s) | NUMBER(p,s) | Preserve precision and scale |
VARCHAR(n) | VARCHAR(n) | Direct mapping |
TEXT | VARCHAR(16777216) | Snowflake max VARCHAR |
BOOLEAN | BOOLEAN | Direct mapping |
TIMESTAMP | TIMESTAMP_NTZ | No timezone |
TIMESTAMPTZ | TIMESTAMP_TZ | With timezone |
JSONB / JSON | VARIANT | Enables Snowflake semi-structured queries |
ARRAY | ARRAY or VARIANT | Depends on element types |
UUID | VARCHAR(36) | Stored as string |
BYTEA | BINARY | Binary data |
Schema Evolution
Production tables change. Columns get added, types get altered, and tables get renamed. Your pipeline must handle schema evolution gracefully. At a minimum, it should automatically add new columns to the Snowflake target table when they appear in PostgreSQL. Managed platforms like Streamkap handle schema evolution automatically, propagating ALTER TABLE ADD COLUMN changes from PostgreSQL to Snowflake without manual intervention or pipeline restarts.
Handling Deletes and Updates
CDC captures every operation, but how you materialize deletes and updates in Snowflake requires deliberate design.
Soft Deletes
Instead of physically deleting rows in Snowflake, append a _deleted flag and a _deleted_at timestamp. This preserves historical data and simplifies auditing.
-- Soft delete approach: mark deleted rows
MERGE INTO analytics.orders AS target
USING staging.orders_cdc AS source
ON target.id = source.id
WHEN MATCHED AND source._cdc_operation = 'DELETE' THEN
UPDATE SET target._deleted = TRUE, target._deleted_at = CURRENT_TIMESTAMP()
WHEN MATCHED AND source._cdc_operation = 'UPDATE' THEN
UPDATE SET target.total_amount = source.total_amount,
target.status = source.status,
target.updated_at = source.updated_at
WHEN NOT MATCHED AND source._cdc_operation != 'DELETE' THEN
INSERT (id, customer_id, total_amount, status, created_at, updated_at)
VALUES (source.id, source.customer_id, source.total_amount, source.status,
source.created_at, source.updated_at);
SCD Type 2
For dimension tables where you need to track historical attribute values, implement Slowly Changing Dimension Type 2 by maintaining valid_from, valid_to, and is_current columns. Each update closes the existing record and inserts a new one rather than overwriting.
Self-Managed vs Managed Platforms
Running a CDC pipeline in production is an ongoing operational commitment. Here is what the two paths look like in practice.
Self-Managed (Debezium + Kafka + Snowpipe)
You deploy and maintain Debezium (via Kafka Connect), a Kafka cluster, and either the Snowflake Kafka connector or a custom Snowpipe integration. This gives you full control but requires:
- Kafka cluster management (brokers, ZooKeeper/KRaft, topic configuration)
- Debezium connector configuration and monitoring
- Schema Registry for Avro/JSON schema management
- Snowflake connector tuning and credential rotation
- Alerting on replication lag, consumer lag, and connector failures
- Handling connector restarts, rebalances, and offset management
Managed (Streamkap)
A managed CDC platform abstracts the infrastructure layer. With Streamkap, you configure the PostgreSQL source and Snowflake destination through a UI or API. The platform handles WAL reading, change capture, schema mapping, exactly-once delivery, and automatic schema evolution. The operational surface shrinks from managing a distributed system to monitoring a dashboard.
| Dimension | Self-Managed | Managed (Streamkap) |
|---|---|---|
| Setup time | Days to weeks | Minutes |
| Ongoing maintenance | High (patching, scaling, monitoring) | Minimal |
| Latency | 5-30 seconds (tunable) | Sub-second capture, seconds to Snowflake |
| Schema evolution | Manual or custom scripts | Automatic |
| Exactly-once delivery | Complex to guarantee | Built-in |
| Cost model | Infrastructure + engineering time | Usage-based |
Performance Tuning
Once the pipeline is running, several parameters affect throughput and latency.
Batch Size and Flush Interval
Most CDC connectors batch events before writing to the destination. A smaller batch size reduces latency but increases the number of write operations (and Snowflake credits consumed). A larger batch improves throughput efficiency but adds latency.
# Typical tuning parameters (connector-specific)
batch.size=1000 # events per batch
flush.interval.ms=5000 # maximum time before flushing
Start with a 5-second flush interval and 1,000-event batch size, then adjust based on observed latency and cost.
Parallelism
Partition your change stream by table or by primary key range to enable parallel loading. Snowpipe Streaming supports multiple channels writing to the same table concurrently, which can dramatically improve throughput for high-volume tables.
Snowflake Warehouse Sizing
If you use MERGE statements or COPY INTO for loading, the virtual warehouse size directly affects load performance. For continuous CDC loads, an X-Small warehouse is typically sufficient for moderate volumes (up to a few thousand events per second). Scale up for initial backfill or catch-up scenarios.
-- Create a dedicated warehouse for CDC ingestion
CREATE WAREHOUSE cdc_loader
WAREHOUSE_SIZE = 'X-SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE;
Monitoring
A real-time pipeline requires real-time monitoring. Track these metrics to detect issues before they affect downstream consumers.
Source Side (PostgreSQL)
- Replication slot lag: The difference between the current WAL position and the slot’s confirmed flush position. Growing lag means the consumer is falling behind.
- WAL disk usage: Monitor
pg_replication_slotsand WAL directory size. Unbounded growth indicates a stalled consumer.
-- Check replication slot lag
SELECT slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots;
Pipeline Metrics
- Events per second: Throughput at each stage (capture, transform, load).
- End-to-end latency: Time from PostgreSQL commit to data queryable in Snowflake. Measure this by inserting a sentinel row with a timestamp and querying it in Snowflake.
- Error rate: Failed deliveries, schema mismatches, or serialization errors.
Destination Side (Snowflake)
- Snowpipe load history: Query
INFORMATION_SCHEMA.LOAD_HISTORYorSNOWPIPE_STREAMING_FILE_MIGRATION_HISTORYto track load latency and row counts. - Row count drift: Periodically compare source and destination row counts to detect silent data loss.
-- Check recent Snowpipe load history
SELECT *
FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
TABLE_NAME => 'ORDERS',
START_TIME => DATEADD(HOUR, -1, CURRENT_TIMESTAMP())
))
ORDER BY LAST_LOAD_TIME DESC;
Set alerts on replication slot lag exceeding 100 MB, end-to-end latency exceeding your SLA threshold, and any non-zero error rates. These three signals catch the vast majority of pipeline failures before they cascade into stale data or missing records.
Final Thoughts
Building a real-time PostgreSQL to Snowflake pipeline is a well-understood problem with mature tooling. The core steps are consistent regardless of implementation: enable logical replication in PostgreSQL, capture changes via CDC, transform events to match your Snowflake schema, and load using Snowpipe Streaming or a Kafka-based sink. The architectural decision that matters most is whether you manage the infrastructure yourself or use a managed platform like Streamkap to eliminate the operational overhead. Either way, the result is fresh, reliable data in Snowflake, measured in seconds, not hours.