<--- Back to all resources

Engineering

February 25, 2026

14 min read

CDC to Destination: Architecture Patterns for Every Target

A complete reference for CDC delivery architecture patterns - direct streaming, hub-and-spoke, and transform-in-flight - with destination-specific guidance for Snowflake, BigQuery, ClickHouse, and more.

TL;DR: • Direct streaming (CDC to destination) minimizes latency and operational overhead for single-destination pipelines. • Hub-and-spoke (CDC to Kafka to multiple destinations) is the standard pattern for data platform teams serving many consumers. • Transform-in-flight (CDC to Flink/processing to destination) is required when data must be reshaped, aggregated, or enriched before landing. • Each destination - Snowflake, BigQuery, Databricks, ClickHouse, Elasticsearch - has a preferred ingestion API that determines achievable latency and throughput.

Change data capture solves the extraction problem: it captures every row-level change from a source database as a stream of events. But extraction is only half the pipeline. The architectural decisions about how those events travel to their destinations - and what happens to them along the way - determine latency, reliability, schema flexibility, and operational complexity.

This guide is a complete reference for CDC delivery architecture. It covers the three foundational patterns, the tradeoffs between them, and concrete guidance for every major analytical destination.


The Three Foundational Patterns

Pattern 1: Direct Streaming (CDC → Destination)

Source DB ──[CDC]──► Destination
(PostgreSQL)         (ClickHouse / Elasticsearch / Redis)

Direct streaming is the simplest topology: CDC events flow from source to a single destination with no intermediate broker. The CDC connector reads the database transaction log and writes directly to the destination using that destination’s native ingestion protocol.

Characteristics:

  • Lowest latency (sub-second achievable)
  • Simplest operational model (one pipeline to monitor)
  • No fan-out: only works for a single destination
  • Destination availability becomes a hard dependency - backpressure or downtime propagates to the source

Best for: Real-time operational use cases with a single target - search indexes, operational caches, feature stores, real-time dashboards. If ClickHouse is your only target and you want the lowest possible latency, go direct.

Not suitable for: Multi-destination delivery, complex transformations, or scenarios where different consumers need different schemas.


Pattern 2: Hub-and-Spoke (CDC → Kafka → Multiple Destinations)

Source DB ──[CDC]──► Kafka ──► Destination A (Snowflake)
                         │──► Destination B (BigQuery)
                         │──► Destination C (Elasticsearch)
                         └──► Destination D (ClickHouse)

Kafka acts as the durable, replayable hub. The CDC connector writes events to Kafka topics; downstream sink connectors (Kafka Connect sinks, custom consumers, or managed connectors) consume those topics and write to each destination independently.

Characteristics:

  • Fan-out: one source feeds any number of destinations
  • Destinations are decoupled - a slow or unavailable destination does not block others
  • Kafka provides replayability: rewind and re-deliver to a destination without re-reading the source DB
  • Operational overhead: Kafka cluster, consumer groups, lag monitoring, schema registry
  • Each hop adds latency (Kafka write + consumer lag)

Best for: Data platform teams serving multiple analytical and operational consumers from the same source system. The standard enterprise pattern.

Kafka topic design for CDC:

ConcernRecommendation
Topic naming{env}.{db}.{schema}.{table} (e.g., prod.orders.public.customers)
PartitioningPartition by primary key to preserve per-row ordering
RetentionAt least 7 days; longer for disaster recovery
CompactionEnable log compaction for event-sourcing and state rebuild use cases
SchemaAvro or Protobuf with a schema registry for consumer compatibility

Pattern 3: Transform-in-Flight (CDC → Processing → Destination)

Source DB ──[CDC]──► Kafka ──► Flink / Processing ──► Destination

When CDC events need to be transformed, enriched, aggregated, or filtered before landing at the destination, a stateful stream processor sits between Kafka and the destination. Apache Flink is the standard choice for this tier.

Characteristics:

  • Full transformation capability: filter, aggregate, join, enrich, reshape
  • Can materialize pre-aggregated views for analytical destinations
  • Adds processing latency on top of the hub-and-spoke latency
  • Higher operational complexity: Flink cluster, checkpointing, state management
  • Enables exactly-once delivery semantics end-to-end (when using Flink’s transactional sink)

Best for: Scenarios requiring pre-aggregation (Snowflake or BigQuery summary tables), cross-stream joins (enrich CDC events before landing), sensitive data masking, or schema transformation.

When not to use it: If your destination (Snowflake, BigQuery) has sufficient compute to do transformations at query time, adding a Flink layer purely for transformation may be unnecessary complexity. Prefer transform-in-flight when the transformation reduces volume significantly (aggregation) or when data must be shaped differently for different consumers from the same source topic.


Hybrid: Multi-Tier Topologies

Real-world architectures combine patterns. A common enterprise topology:

Source DBs (PostgreSQL, MySQL, Oracle)

        │ CDC

Kafka (hub)
    ├──► Direct stream to Elasticsearch (search, sub-second latency required)
    ├──► Direct stream to Redis (feature store, millisecond latency required)
    └──► Flink (transform-in-flight)
              ├──► Snowflake Snowpipe Streaming (enriched analytics)
              └──► BigQuery Storage Write API (ML training datasets)

This topology uses direct streaming for latency-sensitive destinations and hub-and-spoke with transform-in-flight for analytical destinations that benefit from enrichment.


Destination-Specific Guidance

Snowflake

Preferred ingestion API: Snowpipe Streaming (channel-based SDK)

Snowpipe Streaming writes rows directly to Snowflake’s table buffer, bypassing the cloud storage staging step required by standard Snowpipe. This reduces end-to-end latency from minutes to seconds.

Architecture:

CDC → Kafka → Snowflake Kafka Connector (Snowpipe Streaming mode)


              Snowflake Table Buffer
                      │ (seconds, not minutes)

              Snowflake Table (queryable)

Latency characteristics:

  • Standard Snowpipe (file-based): 1–5 minutes
  • Snowpipe Streaming: 5–30 seconds
  • Direct row insert via JDBC: seconds (but does not scale to high throughput)

Schema evolution: Snowflake supports COLUMN ADD without downtime. The Kafka connector can be configured to automatically add new columns when CDC events contain new fields. DROP COLUMN and type changes require manual intervention.

Deduplication: Snowpipe Streaming provides at-least-once delivery. Deduplicate in Snowflake using MERGE statements or stream-based deduplication in dbt. Include the CDC event’s _change_seq_num or lsn (log sequence number) as a column to detect and discard duplicates.

Recommended table design for CDC targets:

CREATE TABLE customers_cdc (
    customer_id     VARCHAR,
    email           VARCHAR,
    tier            VARCHAR,
    _cdc_op         VARCHAR,    -- 'c' (create), 'u' (update), 'd' (delete)
    _cdc_lsn        BIGINT,     -- source log sequence number
    _cdc_timestamp  TIMESTAMP,  -- event timestamp from source
    _loaded_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

BigQuery

Preferred ingestion API: Storage Write API (committed mode)

The BigQuery Storage Write API provides exactly-once semantics and second-range latency. The legacy streaming insert API is now deprecated for new pipelines. The committed write mode creates a writer stream, appends rows, and flushes on commit - similar in concept to Snowpipe Streaming.

Architecture:

CDC → Kafka → BigQuery Kafka Connector (Storage Write API)


              BigQuery Committed Write Stream
                      │ (seconds)

              BigQuery Table (immediately queryable)

Latency characteristics:

  • Legacy streaming insert: seconds (deprecated, higher cost)
  • Storage Write API (committed): 5–15 seconds
  • Storage Write API (pending mode): batch-like, lower cost, higher latency

Partitioning strategy: Partition BigQuery tables by _cdc_timestamp or DATE(_cdc_timestamp) to improve query performance and control retention costs. Cluster by the primary key column for fast point lookups.

Schema evolution: BigQuery supports adding nullable columns automatically. Use the connector’s schema update option to propagate new columns from CDC events. Non-nullable columns and type changes require migration scripts.


Databricks Delta Lake

Preferred ingestion API: Delta Lake auto loader + structured streaming, or Delta Kafka connector

Delta Lake on Databricks supports ACID transactions, schema evolution, and time travel - making it well-suited for CDC targets that need historical queryability.

Two ingestion approaches:

Option A: Kafka → Delta (streaming write)

# Databricks Structured Streaming
spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "prod.orders.public.customers") \
    .load() \
    .writeStream \
    .format("delta") \
    .option("checkpointLocation", "/delta/checkpoints/customers") \
    .toTable("customers_raw")

Option B: Apply CDC as MERGE (upsert)

from delta.tables import DeltaTable

def upsert_to_delta(batch_df, batch_id):
    target = DeltaTable.forName(spark, "customers")
    target.alias("t").merge(
        batch_df.alias("s"),
        "t.customer_id = s.customer_id"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .whenMatchedDelete(condition="s._cdc_op = 'd'") \
     .execute()

Apache Iceberg: The pattern is identical to Delta Lake. Iceberg’s MERGE INTO support and its open format (readable by Snowflake, BigQuery, and Trino without copying data) make it increasingly preferred for multi-engine analytical platforms.


ClickHouse

Preferred ingestion API: Native protocol / HTTP bulk insert

ClickHouse is purpose-built for high-throughput analytical ingestion and achieves sub-second query latency at scale. CDC delivery to ClickHouse requires careful attention to its table engine and deduplication model.

Recommended table engine for CDC targets: ReplacingMergeTree

CREATE TABLE customers
(
    customer_id    UInt64,
    email          String,
    tier           String,
    _cdc_version   UInt64,    -- use LSN or Unix timestamp millis
    _is_deleted    UInt8 DEFAULT 0,
    _updated_at    DateTime
)
ENGINE = ReplacingMergeTree(_cdc_version)
ORDER BY (customer_id);

ReplacingMergeTree deduplicates rows with the same ORDER BY key during background merges, keeping the row with the highest version number. This handles CDC upserts naturally.

Handling deletes: ClickHouse does not natively soft-delete via ReplacingMergeTree. The standard pattern is to set _is_deleted = 1 in the latest version of a row and filter it out in queries:

SELECT * FROM customers FINAL WHERE _is_deleted = 0;

The FINAL keyword forces deduplication at query time rather than waiting for background merges.

Latency: ClickHouse with direct streaming (no Kafka intermediate) achieves end-to-end latency under 1 second for most workloads. Batch size tuning matters: larger batches (thousands of rows) are more efficient than row-by-row inserts.


Elasticsearch / OpenSearch

Preferred ingestion API: Bulk API via Kafka Connect Elasticsearch connector

Elasticsearch is the standard target for full-text search and log analytics CDC use cases. The delivery pattern is typically hub-and-spoke with the Elasticsearch Kafka connector.

Index design for CDC targets:

{
  "mappings": {
    "properties": {
      "customer_id":    { "type": "keyword" },
      "email":          { "type": "keyword" },
      "name":           { "type": "text" },
      "tier":           { "type": "keyword" },
      "_cdc_timestamp": { "type": "date" },
      "_cdc_op":        { "type": "keyword" }
    }
  }
}

Handling deletes: The Elasticsearch connector supports tombstone record handling - a CDC delete event (with null value) triggers a document delete in the index. Configure the connector’s behavior.on.null.values to delete.

Schema evolution: Elasticsearch uses dynamic mapping by default, which will infer types for new fields. For production, use explicit mappings and update them before new fields appear in CDC events to avoid mapping conflicts.


Latency and Throughput Comparison

PatternTypical LatencyMax ThroughputOperational Complexity
Direct: CDC → ClickHouse< 1 secondVery highLow
Direct: CDC → Elasticsearch1–3 secondsHighLow
Hub-and-spoke: → Snowflake (Snowpipe Streaming)5–30 secondsHighMedium
Hub-and-spoke: → BigQuery (Storage Write API)5–15 secondsHighMedium
Hub-and-spoke: → Delta Lake (streaming)10–60 secondsHighMedium
Transform-in-flight: Flink → any destination+5–30 seconds over baseHighHigh

Exactly-Once vs. At-Least-Once Delivery

CDC pipelines almost universally provide at-least-once delivery: in the event of a failure and restart, some events may be redelivered. For idempotent destinations (ClickHouse ReplacingMergeTree, Elasticsearch document IDs, Snowflake MERGE), at-least-once is sufficient - duplicates are handled by the destination’s deduplication mechanism.

Exactly-once delivery requires transactional coordination between the source, the broker, and the destination. Flink provides end-to-end exactly-once when paired with a transactional Kafka source and a transactional sink (Snowflake, Iceberg). This eliminates the need for destination-side deduplication but adds latency (transactions commit in checkpoint intervals, typically 1–30 seconds) and operational complexity.

For most CDC delivery use cases, at-least-once with idempotent destination writes is the pragmatic choice.


Schema Change Handling Across Destinations

Schema changes - column additions, renames, type changes - are the most operationally sensitive aspect of CDC pipelines. A reliable CDC platform like Streamkap propagates schema changes automatically, pausing affected pipelines if a destination cannot accept the new schema and alerting operators.

DestinationColumn AddColumn DropType Change
SnowflakeAutomaticManualManual
BigQueryAutomatic (nullable)ManualManual
ClickHouseRequires ALTER TABLERequires ALTER TABLEManual
ElasticsearchAuto (dynamic mapping)Index rebuildIndex rebuild
Delta LakeAutomatic (schema evolution)ManualManual

Architecture Decision Guide

When choosing a CDC delivery pattern, answer these questions in order:

  1. How many destinations? One → consider direct streaming. More than one → hub-and-spoke.
  2. What latency is required? Sub-second → direct or direct-to-ClickHouse/Elasticsearch. Seconds → Snowpipe Streaming / Storage Write API. Minutes → any pattern.
  3. Do events need transformation before landing? Yes → add transform-in-flight tier. No → skip Flink complexity.
  4. Do consumers need different schemas? Yes → Kafka + per-consumer transformation. No → single sink.
  5. Is replay required? Yes → route through Kafka for configurable retention. No → direct is fine.

There is no universally correct pattern. Direct streaming minimizes complexity and latency. Hub-and-spoke maximizes flexibility. Transform-in-flight enables data product architecture. The right answer depends on how many consumers you serve, what SLAs they require, and how much operational overhead your team can sustain.