<--- Back to all resources
CDC to Destination: Architecture Patterns for Every Target
A complete reference for CDC delivery architecture patterns - direct streaming, hub-and-spoke, and transform-in-flight - with destination-specific guidance for Snowflake, BigQuery, ClickHouse, and more.
Change data capture solves the extraction problem: it captures every row-level change from a source database as a stream of events. But extraction is only half the pipeline. The architectural decisions about how those events travel to their destinations - and what happens to them along the way - determine latency, reliability, schema flexibility, and operational complexity.
This guide is a complete reference for CDC delivery architecture. It covers the three foundational patterns, the tradeoffs between them, and concrete guidance for every major analytical destination.
The Three Foundational Patterns
Pattern 1: Direct Streaming (CDC → Destination)
Source DB ──[CDC]──► Destination
(PostgreSQL) (ClickHouse / Elasticsearch / Redis)
Direct streaming is the simplest topology: CDC events flow from source to a single destination with no intermediate broker. The CDC connector reads the database transaction log and writes directly to the destination using that destination’s native ingestion protocol.
Characteristics:
- Lowest latency (sub-second achievable)
- Simplest operational model (one pipeline to monitor)
- No fan-out: only works for a single destination
- Destination availability becomes a hard dependency - backpressure or downtime propagates to the source
Best for: Real-time operational use cases with a single target - search indexes, operational caches, feature stores, real-time dashboards. If ClickHouse is your only target and you want the lowest possible latency, go direct.
Not suitable for: Multi-destination delivery, complex transformations, or scenarios where different consumers need different schemas.
Pattern 2: Hub-and-Spoke (CDC → Kafka → Multiple Destinations)
Source DB ──[CDC]──► Kafka ──► Destination A (Snowflake)
│──► Destination B (BigQuery)
│──► Destination C (Elasticsearch)
└──► Destination D (ClickHouse)
Kafka acts as the durable, replayable hub. The CDC connector writes events to Kafka topics; downstream sink connectors (Kafka Connect sinks, custom consumers, or managed connectors) consume those topics and write to each destination independently.
Characteristics:
- Fan-out: one source feeds any number of destinations
- Destinations are decoupled - a slow or unavailable destination does not block others
- Kafka provides replayability: rewind and re-deliver to a destination without re-reading the source DB
- Operational overhead: Kafka cluster, consumer groups, lag monitoring, schema registry
- Each hop adds latency (Kafka write + consumer lag)
Best for: Data platform teams serving multiple analytical and operational consumers from the same source system. The standard enterprise pattern.
Kafka topic design for CDC:
| Concern | Recommendation |
|---|---|
| Topic naming | {env}.{db}.{schema}.{table} (e.g., prod.orders.public.customers) |
| Partitioning | Partition by primary key to preserve per-row ordering |
| Retention | At least 7 days; longer for disaster recovery |
| Compaction | Enable log compaction for event-sourcing and state rebuild use cases |
| Schema | Avro or Protobuf with a schema registry for consumer compatibility |
Pattern 3: Transform-in-Flight (CDC → Processing → Destination)
Source DB ──[CDC]──► Kafka ──► Flink / Processing ──► Destination
When CDC events need to be transformed, enriched, aggregated, or filtered before landing at the destination, a stateful stream processor sits between Kafka and the destination. Apache Flink is the standard choice for this tier.
Characteristics:
- Full transformation capability: filter, aggregate, join, enrich, reshape
- Can materialize pre-aggregated views for analytical destinations
- Adds processing latency on top of the hub-and-spoke latency
- Higher operational complexity: Flink cluster, checkpointing, state management
- Enables exactly-once delivery semantics end-to-end (when using Flink’s transactional sink)
Best for: Scenarios requiring pre-aggregation (Snowflake or BigQuery summary tables), cross-stream joins (enrich CDC events before landing), sensitive data masking, or schema transformation.
When not to use it: If your destination (Snowflake, BigQuery) has sufficient compute to do transformations at query time, adding a Flink layer purely for transformation may be unnecessary complexity. Prefer transform-in-flight when the transformation reduces volume significantly (aggregation) or when data must be shaped differently for different consumers from the same source topic.
Hybrid: Multi-Tier Topologies
Real-world architectures combine patterns. A common enterprise topology:
Source DBs (PostgreSQL, MySQL, Oracle)
│
│ CDC
▼
Kafka (hub)
├──► Direct stream to Elasticsearch (search, sub-second latency required)
├──► Direct stream to Redis (feature store, millisecond latency required)
└──► Flink (transform-in-flight)
├──► Snowflake Snowpipe Streaming (enriched analytics)
└──► BigQuery Storage Write API (ML training datasets)
This topology uses direct streaming for latency-sensitive destinations and hub-and-spoke with transform-in-flight for analytical destinations that benefit from enrichment.
Destination-Specific Guidance
Snowflake
Preferred ingestion API: Snowpipe Streaming (channel-based SDK)
Snowpipe Streaming writes rows directly to Snowflake’s table buffer, bypassing the cloud storage staging step required by standard Snowpipe. This reduces end-to-end latency from minutes to seconds.
Architecture:
CDC → Kafka → Snowflake Kafka Connector (Snowpipe Streaming mode)
│
▼
Snowflake Table Buffer
│ (seconds, not minutes)
▼
Snowflake Table (queryable)
Latency characteristics:
- Standard Snowpipe (file-based): 1–5 minutes
- Snowpipe Streaming: 5–30 seconds
- Direct row insert via JDBC: seconds (but does not scale to high throughput)
Schema evolution: Snowflake supports COLUMN ADD without downtime. The Kafka connector can be configured to automatically add new columns when CDC events contain new fields. DROP COLUMN and type changes require manual intervention.
Deduplication: Snowpipe Streaming provides at-least-once delivery. Deduplicate in Snowflake using MERGE statements or stream-based deduplication in dbt. Include the CDC event’s _change_seq_num or lsn (log sequence number) as a column to detect and discard duplicates.
Recommended table design for CDC targets:
CREATE TABLE customers_cdc (
customer_id VARCHAR,
email VARCHAR,
tier VARCHAR,
_cdc_op VARCHAR, -- 'c' (create), 'u' (update), 'd' (delete)
_cdc_lsn BIGINT, -- source log sequence number
_cdc_timestamp TIMESTAMP, -- event timestamp from source
_loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
BigQuery
Preferred ingestion API: Storage Write API (committed mode)
The BigQuery Storage Write API provides exactly-once semantics and second-range latency. The legacy streaming insert API is now deprecated for new pipelines. The committed write mode creates a writer stream, appends rows, and flushes on commit - similar in concept to Snowpipe Streaming.
Architecture:
CDC → Kafka → BigQuery Kafka Connector (Storage Write API)
│
▼
BigQuery Committed Write Stream
│ (seconds)
▼
BigQuery Table (immediately queryable)
Latency characteristics:
- Legacy streaming insert: seconds (deprecated, higher cost)
- Storage Write API (committed): 5–15 seconds
- Storage Write API (pending mode): batch-like, lower cost, higher latency
Partitioning strategy: Partition BigQuery tables by _cdc_timestamp or DATE(_cdc_timestamp) to improve query performance and control retention costs. Cluster by the primary key column for fast point lookups.
Schema evolution: BigQuery supports adding nullable columns automatically. Use the connector’s schema update option to propagate new columns from CDC events. Non-nullable columns and type changes require migration scripts.
Databricks Delta Lake
Preferred ingestion API: Delta Lake auto loader + structured streaming, or Delta Kafka connector
Delta Lake on Databricks supports ACID transactions, schema evolution, and time travel - making it well-suited for CDC targets that need historical queryability.
Two ingestion approaches:
Option A: Kafka → Delta (streaming write)
# Databricks Structured Streaming
spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "broker:9092") \
.option("subscribe", "prod.orders.public.customers") \
.load() \
.writeStream \
.format("delta") \
.option("checkpointLocation", "/delta/checkpoints/customers") \
.toTable("customers_raw")
Option B: Apply CDC as MERGE (upsert)
from delta.tables import DeltaTable
def upsert_to_delta(batch_df, batch_id):
target = DeltaTable.forName(spark, "customers")
target.alias("t").merge(
batch_df.alias("s"),
"t.customer_id = s.customer_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.whenMatchedDelete(condition="s._cdc_op = 'd'") \
.execute()
Apache Iceberg: The pattern is identical to Delta Lake. Iceberg’s MERGE INTO support and its open format (readable by Snowflake, BigQuery, and Trino without copying data) make it increasingly preferred for multi-engine analytical platforms.
ClickHouse
Preferred ingestion API: Native protocol / HTTP bulk insert
ClickHouse is purpose-built for high-throughput analytical ingestion and achieves sub-second query latency at scale. CDC delivery to ClickHouse requires careful attention to its table engine and deduplication model.
Recommended table engine for CDC targets: ReplacingMergeTree
CREATE TABLE customers
(
customer_id UInt64,
email String,
tier String,
_cdc_version UInt64, -- use LSN or Unix timestamp millis
_is_deleted UInt8 DEFAULT 0,
_updated_at DateTime
)
ENGINE = ReplacingMergeTree(_cdc_version)
ORDER BY (customer_id);
ReplacingMergeTree deduplicates rows with the same ORDER BY key during background merges, keeping the row with the highest version number. This handles CDC upserts naturally.
Handling deletes: ClickHouse does not natively soft-delete via ReplacingMergeTree. The standard pattern is to set _is_deleted = 1 in the latest version of a row and filter it out in queries:
SELECT * FROM customers FINAL WHERE _is_deleted = 0;
The FINAL keyword forces deduplication at query time rather than waiting for background merges.
Latency: ClickHouse with direct streaming (no Kafka intermediate) achieves end-to-end latency under 1 second for most workloads. Batch size tuning matters: larger batches (thousands of rows) are more efficient than row-by-row inserts.
Elasticsearch / OpenSearch
Preferred ingestion API: Bulk API via Kafka Connect Elasticsearch connector
Elasticsearch is the standard target for full-text search and log analytics CDC use cases. The delivery pattern is typically hub-and-spoke with the Elasticsearch Kafka connector.
Index design for CDC targets:
{
"mappings": {
"properties": {
"customer_id": { "type": "keyword" },
"email": { "type": "keyword" },
"name": { "type": "text" },
"tier": { "type": "keyword" },
"_cdc_timestamp": { "type": "date" },
"_cdc_op": { "type": "keyword" }
}
}
}
Handling deletes: The Elasticsearch connector supports tombstone record handling - a CDC delete event (with null value) triggers a document delete in the index. Configure the connector’s behavior.on.null.values to delete.
Schema evolution: Elasticsearch uses dynamic mapping by default, which will infer types for new fields. For production, use explicit mappings and update them before new fields appear in CDC events to avoid mapping conflicts.
Latency and Throughput Comparison
| Pattern | Typical Latency | Max Throughput | Operational Complexity |
|---|---|---|---|
| Direct: CDC → ClickHouse | < 1 second | Very high | Low |
| Direct: CDC → Elasticsearch | 1–3 seconds | High | Low |
| Hub-and-spoke: → Snowflake (Snowpipe Streaming) | 5–30 seconds | High | Medium |
| Hub-and-spoke: → BigQuery (Storage Write API) | 5–15 seconds | High | Medium |
| Hub-and-spoke: → Delta Lake (streaming) | 10–60 seconds | High | Medium |
| Transform-in-flight: Flink → any destination | +5–30 seconds over base | High | High |
Exactly-Once vs. At-Least-Once Delivery
CDC pipelines almost universally provide at-least-once delivery: in the event of a failure and restart, some events may be redelivered. For idempotent destinations (ClickHouse ReplacingMergeTree, Elasticsearch document IDs, Snowflake MERGE), at-least-once is sufficient - duplicates are handled by the destination’s deduplication mechanism.
Exactly-once delivery requires transactional coordination between the source, the broker, and the destination. Flink provides end-to-end exactly-once when paired with a transactional Kafka source and a transactional sink (Snowflake, Iceberg). This eliminates the need for destination-side deduplication but adds latency (transactions commit in checkpoint intervals, typically 1–30 seconds) and operational complexity.
For most CDC delivery use cases, at-least-once with idempotent destination writes is the pragmatic choice.
Schema Change Handling Across Destinations
Schema changes - column additions, renames, type changes - are the most operationally sensitive aspect of CDC pipelines. A reliable CDC platform like Streamkap propagates schema changes automatically, pausing affected pipelines if a destination cannot accept the new schema and alerting operators.
| Destination | Column Add | Column Drop | Type Change |
|---|---|---|---|
| Snowflake | Automatic | Manual | Manual |
| BigQuery | Automatic (nullable) | Manual | Manual |
| ClickHouse | Requires ALTER TABLE | Requires ALTER TABLE | Manual |
| Elasticsearch | Auto (dynamic mapping) | Index rebuild | Index rebuild |
| Delta Lake | Automatic (schema evolution) | Manual | Manual |
Architecture Decision Guide
When choosing a CDC delivery pattern, answer these questions in order:
- How many destinations? One → consider direct streaming. More than one → hub-and-spoke.
- What latency is required? Sub-second → direct or direct-to-ClickHouse/Elasticsearch. Seconds → Snowpipe Streaming / Storage Write API. Minutes → any pattern.
- Do events need transformation before landing? Yes → add transform-in-flight tier. No → skip Flink complexity.
- Do consumers need different schemas? Yes → Kafka + per-consumer transformation. No → single sink.
- Is replay required? Yes → route through Kafka for configurable retention. No → direct is fine.
There is no universally correct pattern. Direct streaming minimizes complexity and latency. Hub-and-spoke maximizes flexibility. Transform-in-flight enables data product architecture. The right answer depends on how many consumers you serve, what SLAs they require, and how much operational overhead your team can sustain.