<--- Back to all resources

Engineering

February 25, 2026

10 min read

PostgreSQL to BigQuery with CDC: Real-Time Analytics Pipeline

Build a real-time data pipeline from PostgreSQL to BigQuery using Change Data Capture. Learn architecture patterns, schema mapping, and best practices for sub-minute analytics.

TL;DR: • CDC captures every INSERT, UPDATE, and DELETE from PostgreSQL's WAL and streams them to BigQuery, enabling analytics on data that is seconds old instead of hours. • BigQuery's Storage Write API is the preferred ingestion method for CDC streams because it supports exactly-once semantics and avoids the cost of staging files in GCS. • Managed CDC platforms like Streamkap handle the entire pipeline from PostgreSQL WAL to BigQuery table, including schema evolution and type mapping.

Organizations running PostgreSQL as their transactional database frequently need fresh data in BigQuery for analytics, dashboarding, and machine learning. Traditional batch ETL jobs that run every few hours leave analysts working with stale data and force engineering teams to maintain fragile extraction scripts. Change Data Capture (CDC) eliminates both problems by tailing PostgreSQL’s write-ahead log (WAL) and streaming every row-level change to BigQuery within seconds. The result is an analytics warehouse that reflects production data in near real time, without putting any read load on the source database.

This guide walks through the architecture patterns, configuration steps, and operational best practices for building a production-grade PostgreSQL-to-BigQuery CDC pipeline.

Architecture Patterns

There are three primary ways to move CDC data from PostgreSQL into BigQuery, each with different trade-offs in complexity, latency, and fault tolerance.

Direct CDC to BigQuery

A CDC connector reads the PostgreSQL WAL and writes directly to BigQuery using the Storage Write API. This is the simplest topology and works well when you have a single source database and do not need to fan out changes to multiple consumers. Latency is typically 10 to 30 seconds end-to-end. The downside is that there is no intermediate buffer; if BigQuery is temporarily unavailable, the connector must handle back-pressure and retry logic internally.

CDC via Kafka to BigQuery

In this pattern, a CDC connector (commonly Debezium) publishes WAL changes to Apache Kafka topics, and a separate sink connector consumes from Kafka and writes to BigQuery. Kafka acts as a durable buffer that decouples the source from the destination. This pattern is a natural fit when multiple downstream systems need the same change events, or when you already operate a Kafka cluster. The trade-off is operational complexity: you now manage a Kafka cluster, a source connector, and a sink connector.

Dataflow-Based Pipeline

Google Cloud Dataflow can read from Kafka or Pub/Sub and write to BigQuery using Apache Beam transforms. This pattern adds the ability to enrich, filter, or reshape events in flight before they land in BigQuery. It is the most flexible option but also the most expensive and complex, requiring Beam pipeline development and Dataflow runner management.

For most teams, the direct or Kafka-based approach covers the majority of use cases. Dataflow makes sense only when in-flight transformation logic is non-trivial.

PostgreSQL CDC Setup

Before any CDC tool can capture changes, PostgreSQL must be configured for logical replication.

Enable Logical Replication

Set the WAL level to logical in postgresql.conf and restart the server:

ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 4;
ALTER SYSTEM SET max_wal_senders = 4;

After restarting, verify the setting:

SHOW wal_level;
-- Should return: logical

Create a Replication User and Publication

Create a dedicated user with replication privileges and a publication for the tables you want to capture:

CREATE ROLE cdc_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
GRANT USAGE ON SCHEMA public TO cdc_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;

-- Create a publication for specific tables
CREATE PUBLICATION bigquery_pub FOR TABLE orders, customers, products;

-- Or capture all tables
CREATE PUBLICATION bigquery_pub FOR ALL TABLES;

Create a Replication Slot

The replication slot tracks the consumer’s position in the WAL so that PostgreSQL retains the segments the consumer has not yet read:

SELECT pg_create_logical_replication_slot('bigquery_slot', 'pgoutput');

Monitor the slot to avoid unbounded WAL growth:

SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name = 'bigquery_slot';

If lag_bytes grows continuously, your consumer is not keeping up and WAL disk usage will increase until it is addressed.

BigQuery Ingestion Methods

How you land data in BigQuery has a significant impact on cost, latency, and delivery guarantees.

Storage Write API

The Storage Write API is the recommended method for CDC workloads. It supports exactly-once semantics in committed mode and avoids the overhead of staging files in Google Cloud Storage. Data is written directly into BigQuery’s columnar storage, making it queryable as soon as the stream is committed. Pricing is based on the volume of data ingested (currently $0.025 per 200 MB for committed writes at the time of writing), which is substantially cheaper than the legacy streaming API.

Legacy Streaming Inserts

The tabledata.insertAll API offers row-level inserts with best-effort deduplication. It is simpler to use than the Storage Write API but more expensive ($0.01 per 200 MB) and does not provide exactly-once guarantees. For new pipelines, prefer the Storage Write API.

Staged Loads via GCS

You can write CDC events as Avro or Parquet files to a GCS bucket and then trigger a BigQuery load job. Load jobs are free of ingestion charges but introduce higher latency (typically minutes) and require orchestration logic to manage file batching and job submission.

BigQuery Connector for Kafka

If your architecture routes changes through Kafka, the open-source BigQuery Connector for Apache Kafka (formerly the Confluent BigQuery Sink Connector) consumes Kafka topics and writes to BigQuery using either the Storage Write API or load jobs. This is the standard approach for Kafka-based CDC pipelines.

Schema Mapping: PostgreSQL Types to BigQuery Types

Accurate type mapping is essential to avoid silent data corruption or failed loads.

PostgreSQL TypeBigQuery TypeNotes
integer, bigintINT64Direct mapping
numeric(p,s)NUMERIC or BIGNUMERICUse BIGNUMERIC for precision > 29
real, double precisionFLOAT64Potential floating-point precision loss
booleanBOOLDirect mapping
text, varcharSTRINGDirect mapping
timestampTIMESTAMPEnsure timezone handling is consistent
timestamptzTIMESTAMPCDC tools typically emit UTC
dateDATEDirect mapping
jsonbSTRING or JSONBigQuery JSON type supports native querying
uuidSTRINGNo native UUID type in BigQuery
integer[]REPEATED INT64Requires schema-aware connector
byteaBYTESDirect mapping

Handling JSONB and Arrays

PostgreSQL jsonb columns can be mapped to BigQuery’s native JSON type, which allows you to query nested fields with JSON_VALUE() and JSON_QUERY() functions without parsing the entire document. If your CDC tool does not support the BigQuery JSON type, map to STRING and parse at query time.

PostgreSQL arrays can be mapped to BigQuery REPEATED fields if your connector supports it. Otherwise, serialize them as JSON strings and use JSON_EXTRACT_ARRAY() in BigQuery.

Handling Updates and Deletes

BigQuery is an append-optimized warehouse. CDC workloads that include UPDATE and DELETE operations require a strategy to represent mutable data.

Soft Deletes with a _deleted Flag

The simplest approach is to add a _deleted column (BOOL) and a _ts column (TIMESTAMP) to each row. Inserts and updates append a new row with the latest values. Deletes append a row with _deleted = true. Queries filter out deleted rows:

SELECT * FROM orders
WHERE _deleted = false
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _ts DESC) = 1;

This approach is straightforward but increases storage and makes queries slightly more complex.

Periodic MERGE Statements

You can land CDC events into a staging table and periodically run a MERGE statement to apply changes to a final table:

MERGE INTO orders_final AS target
USING orders_staging AS source
ON target.id = source.id
WHEN MATCHED AND source._deleted = true THEN DELETE
WHEN MATCHED THEN UPDATE SET
  target.status = source.status,
  target.total = source.total,
  target.updated_at = source.updated_at
WHEN NOT MATCHED AND source._deleted = false THEN
  INSERT (id, status, total, created_at, updated_at)
  VALUES (source.id, source.status, source.total, source.created_at, source.updated_at);

Schedule this with BigQuery scheduled queries or an orchestrator like Airflow.

BigQuery CDC-Native Tables

BigQuery’s max_staleness feature allows you to define tables that automatically apply upserts and deletes. When combined with the Storage Write API in upsert mode, BigQuery handles deduplication internally. This is the cleanest solution when available for your workload.

Partitioning and Clustering

Proper table design dramatically affects query performance and cost on CDC data.

Partitioning

Partition BigQuery tables by the CDC event timestamp or the source row’s updated_at column. This ensures that queries scanning recent data only read relevant partitions:

CREATE TABLE orders (
  id INT64,
  status STRING,
  total NUMERIC,
  created_at TIMESTAMP,
  updated_at TIMESTAMP,
  _cdc_timestamp TIMESTAMP
)
PARTITION BY DATE(_cdc_timestamp)
CLUSTER BY id;

Use require_partition_filter = true to prevent accidental full-table scans.

Clustering

Cluster by the primary key column (e.g., id). This collocates rows for the same entity, making MERGE operations and point lookups faster and cheaper.

Cost Optimization

CDC pipelines can become expensive at scale if not configured carefully.

Storage Write API pricing: Committed writes cost $0.025 per 200 MB. For a pipeline ingesting 10 GB per day, that is approximately $1.25/day or $37.50/month. This is significantly cheaper than legacy streaming inserts.

Avoid over-replication: Use a PUBLICATION that includes only the tables BigQuery consumers actually need. Capturing every table in a busy PostgreSQL instance generates unnecessary WAL traffic and BigQuery ingestion costs.

Partition expiration: Set a partition expiration on staging tables to automatically delete old CDC events after they have been merged into the final table:

ALTER TABLE orders_staging
SET OPTIONS (partition_expiration_days = 7);

Slot reservations vs on-demand: For sustained high-throughput pipelines, BigQuery flat-rate slot reservations can be more cost-effective than on-demand pricing. Evaluate your query and ingestion patterns before committing.

Self-Managed vs Managed Pipelines

Building a CDC pipeline from scratch requires assembling and operating multiple components: Debezium, Kafka (or Kafka Connect), schema registry, BigQuery sink connector, monitoring, and alerting. Each component introduces its own failure modes, version compatibility matrix, and operational burden.

Managed CDC platforms like Streamkap collapse the entire pipeline into a configured service. You point it at your PostgreSQL instance and your BigQuery dataset, and it handles WAL reading, schema evolution, type mapping, exactly-once delivery, and monitoring. For teams that do not have dedicated streaming infrastructure engineers, the operational savings alone justify the cost.

The decision comes down to control vs. effort. Self-managed gives you full control over every component but demands ongoing engineering investment. Managed platforms trade some customization for dramatically reduced time-to-production and lower operational risk.

Monitoring and Alerting

A CDC pipeline has several failure points that require continuous monitoring.

PostgreSQL Side

  • Replication slot lag: Monitor pg_wal_lsn_diff between the current WAL position and the slot’s restart_lsn. Alert if lag exceeds a threshold (e.g., 1 GB) to prevent disk exhaustion.
  • WAL disk usage: Track the pg_wal directory size. Unbounded growth means the consumer is not acknowledging progress.
  • Replication slot activity: Alert if a slot becomes inactive, which means the consumer has disconnected.
SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;

BigQuery Side

  • Ingestion rate: Monitor INFORMATION_SCHEMA.STREAMING_TIMELINE to verify rows are arriving at the expected rate.
  • Write API errors: Track error rates on Storage Write API calls. Common issues include schema mismatches and quota exhaustion.
  • Staleness: Compare the maximum _cdc_timestamp in BigQuery with the current time. If the gap exceeds your SLA (e.g., 60 seconds), trigger an alert.
SELECT TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(_cdc_timestamp), SECOND) AS staleness_seconds
FROM `project.dataset.orders`;

Pipeline-Level Metrics

  • End-to-end latency: Measure the time between a row being committed in PostgreSQL and that row being queryable in BigQuery. This is the metric your stakeholders care about.
  • Throughput: Track rows per second and bytes per second through the pipeline. Sudden drops indicate a bottleneck or failure.
  • Schema drift detection: Alert when PostgreSQL schema changes (column additions, type changes) that have not been propagated to BigQuery. Streamkap handles schema evolution automatically, but self-managed pipelines need explicit detection and migration logic.

Putting It All Together

A well-built PostgreSQL-to-BigQuery CDC pipeline delivers analytics data that is seconds old rather than hours old, without burdening the source database with extraction queries. The key decisions are: which architecture pattern to use (direct, Kafka-buffered, or Dataflow), how to handle updates and deletes in BigQuery’s append-optimized storage, and whether to operate the pipeline yourself or use a managed platform.

For most teams, the combination of PostgreSQL logical replication, the BigQuery Storage Write API, and a managed CDC service like Streamkap provides the best balance of low latency, exactly-once delivery, and minimal operational overhead. Start with a single table, validate end-to-end latency and correctness, then expand to your full schema as confidence grows.