Why would I need to merge CDC from multiple databases?

Common scenarios include: merging regional databases into a global analytics warehouse, combining data from microservices that each own their database, unifying legacy and modern systems during migration, and building a 360-degree view of entities (like customers) that exist across multiple systems.

How do I handle schema differences across sources?

Use a transformation layer (Flink SQL, built-in transforms, or stream processing) to normalize schemas before writing to the destination. This includes renaming columns to a common naming convention, converting types (e.g., MySQL DATETIME to PostgreSQL TIMESTAMPTZ), and adding source metadata columns (_source_database, _source_table) for lineage.

How do I avoid primary key conflicts when merging multiple sources?

Use composite keys that include the source identifier, for example (source_system, original_id) as the primary key in the destination. Alternatively, generate globally unique IDs (UUIDs) in the transformation layer, or use a deterministic hash of source + original_id. Never assume auto-increment IDs are unique across independent databases.

Does Streamkap support multi-source CDC pipelines?

Yes. Streamkap can run multiple CDC source connectors simultaneously (PostgreSQL, MySQL, MongoDB, DynamoDB, and more) and merge them into a single destination like Snowflake, Databricks, or BigQuery. Built-in transformations handle schema mapping, field renaming, and computed columns for source identification.

<--- Back to all resources

Data Integration

February 25, 2026

10 min read

Multi-Source CDC to a Single Destination: Merging Streams

Learn how to merge CDC streams from multiple databases into a single destination table. Handle schema conflicts, ordering guarantees, and identity resolution across sources.

TL;DR: • Multi-source CDC merges change streams from multiple databases (often different engines like PostgreSQL, MySQL, MongoDB) into a single destination table for unified analytics. • The key challenges are schema alignment (different column names and types across sources), identity resolution (same entity in multiple systems), and ordering (no global order across independent databases). • Managed CDC platforms with built-in transformations simplify multi-source pipelines by handling schema mapping, type conversion, and merge logic.

Most organizations do not run a single database. They run dozens, sometimes hundreds, spread across regions, business units, application services, and technology generations. A company might have customer records in a PostgreSQL cluster in the US, order data in MySQL across three European regions, and product catalog information in MongoDB. Each database is authoritative for its own domain, but no single system holds the complete picture.

Getting that complete picture requires merging change streams from all of those sources into a single destination - a warehouse, a lakehouse, or an analytical database where the data can be queried together. This is multi-source CDC, and while the concept is straightforward, the execution involves a set of challenges that are easy to underestimate.

This guide walks through the architecture, the pitfalls, and the practical techniques for building reliable multi-source CDC pipelines that merge cleanly into a single destination.

Why Multi-Source CDC Is Increasingly Common

Three trends have made multi-source CDC a near-universal requirement for data teams.

Microservices and database-per-service. The microservices architecture pattern encourages each service to own its database. Your authentication service runs PostgreSQL, your billing service runs MySQL, your notifications service runs DynamoDB. Each is optimized for its workload, but the analytics team needs to join data across all of them. Batch ETL can handle this, but the latency is measured in hours. CDC brings it down to seconds.

Regional and multi-tenant databases. Global companies often shard by geography. A SaaS platform might run separate database clusters for North America, Europe, and Asia-Pacific, with identical schemas but different data. Consolidating these into a single analytical store is essential for global reporting, capacity planning, and customer success.

Legacy-to-modern migration. During database migrations, both the old and new systems run simultaneously for weeks or months. CDC from both systems ensures that the analytical warehouse stays current regardless of which system is handling production traffic at any given moment.

Use Cases

Regional Database Consolidation

A retail company operates PostgreSQL databases in three regions: us-east, eu-west, and ap-southeast. Each database has the same orders table with the same schema. The analytics team needs a single global_orders table in Snowflake that contains every order from every region, updated in real time.

Microservice Data Aggregation

An e-commerce platform has separate services for users (PostgreSQL), orders (MySQL), and payments (MongoDB). Building a revenue dashboard requires joining data from all three. Rather than building point-to-point integrations between services, CDC streams from each database are merged into the warehouse where the join happens.

Customer 360

A financial services company stores customer identity in one system, transaction history in another, and support tickets in a third. Creating a unified customer view means merging CDC streams from all three sources into a single customer_360 table, resolving the same customer across systems that use different identifiers.

Migration Cutover

An organization is migrating from Oracle to PostgreSQL. During the transition period, some tables are being served by Oracle and others by PostgreSQL. CDC from both databases feeds the same warehouse, ensuring continuity for downstream analytics regardless of migration progress.

Architecture Patterns

There are three primary patterns for merging CDC streams from multiple sources into a single destination.

Fan-In via Kafka Topics

Each source connector writes to its own Kafka topic. A stream processing layer (Flink, ksqlDB, or a custom consumer) reads from all topics, normalizes the schemas, and writes to a single output topic that the destination connector consumes.

PostgreSQL (us-east) ──CDC──▶ topic: orders.us-east ──┐
PostgreSQL (eu-west) ──CDC──▶ topic: orders.eu-west ──┼──▶ Flink ──▶ topic: global_orders ──▶ Snowflake
PostgreSQL (ap-south) ──CDC──▶ topic: orders.ap-south ─┘

This pattern gives you the most control. The stream processing layer can handle schema normalization, deduplication, filtering, and enrichment before data reaches the destination. The trade-off is operational complexity: you are running and maintaining a stream processing cluster.

Connector-Level Transformations

Some CDC platforms support transformations at the connector level, eliminating the need for a separate stream processing layer. Each source connector applies field renames, type conversions, and metadata injection as part of its processing, then writes directly to the destination.

PostgreSQL (us-east) ──CDC + Transform──▶ Snowflake (global_orders)
MySQL (eu-west)      ──CDC + Transform──▶ Snowflake (global_orders)
MongoDB (ap-south)   ──CDC + Transform──▶ Snowflake (global_orders)

Streamkap supports this pattern through built-in transformations (field renaming, type mapping, computed columns, and routing) so you can merge multiple sources into a single destination table without deploying Flink or ksqlDB.

Destination-Level MERGE

Each source writes to a separate staging table in the destination. A scheduled SQL job (dbt, stored procedure, or orchestrator) merges the staging tables into a final unified table using MERGE or INSERT ... ON CONFLICT statements.

-- Destination-level merge in Snowflake
MERGE INTO global_orders AS target
USING orders_us_east_staging AS source
ON target.source_system = 'us-east' AND target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

This pattern is simpler to set up but introduces latency (the merge runs on a schedule, not continuously) and puts transformation logic in the warehouse, where compute costs can add up.

Schema Alignment

When sources use different database engines, schema alignment is the first challenge. The same logical entity (an order, a customer, a product) may be represented differently across systems.

Column Renaming

Source A calls it customer_email. Source B calls it email_address. Source C calls it usr_email. Before merging, all three must be mapped to a single canonical column name.

-- Flink SQL example: normalize column names
SELECT
  order_id,
  customer_email AS email,     -- Source A
  created_at AS order_date
FROM source_a_orders

UNION ALL

SELECT
  id AS order_id,
  email_address AS email,      -- Source B
  order_timestamp AS order_date
FROM source_b_orders;

Type Normalization

MySQL’s DATETIME has no timezone. PostgreSQL’s TIMESTAMPTZ does. MongoDB stores dates as BSON Date objects (milliseconds since epoch). When merging these into a single destination column, you need a consistent type, typically TIMESTAMP_TZ or TIMESTAMP_NTZ with an explicit UTC conversion applied during transformation.

Similarly, MySQL’s TINYINT(1) is often used as a boolean, while PostgreSQL has a native BOOLEAN type, and MongoDB uses a BSON boolean. The transformation layer must normalize all three to the destination’s boolean representation.

Handling Columns That Exist in One Source but Not Another

Source A has a loyalty_tier column. Source B does not. The merged table should include loyalty_tier, but rows from Source B will have NULL for that field. This is straightforward in SQL (NULL AS loyalty_tier), but it requires maintaining a superset schema (the union of all columns across all sources) and keeping that schema updated as individual sources evolve.

Identity Resolution

The most consequential decision in a multi-source merge is how you identify the same entity across systems.

Composite Keys

The simplest approach is to use a composite primary key that includes the source identifier:

CREATE TABLE global_customers (
  source_system   VARCHAR(50),   -- 'us-east', 'eu-west', 'ap-south'
  original_id     BIGINT,
  email           VARCHAR(255),
  full_name       VARCHAR(255),
  updated_at      TIMESTAMP_TZ,
  PRIMARY KEY (source_system, original_id)
);

This guarantees uniqueness: customer ID 12345 from us-east and customer ID 12345 from eu-west are treated as separate rows. This is correct when the sources contain genuinely different entities (regional shards with no cross-region overlap).

UUID Generation

When you need a single global identifier, generate a UUID in the transformation layer:

SELECT
  UUID_STRING(MD5(CONCAT('us-east', ':', CAST(customer_id AS VARCHAR)))) AS global_id,
  'us-east' AS source_system,
  customer_id AS original_id,
  email,
  full_name
FROM us_east_customers;

The deterministic hash ensures that the same source and original ID always produce the same UUID, making the operation idempotent.

Cross-Source Entity Matching

When the same logical entity exists in multiple sources under different IDs (the same customer in both the billing system and the support system), you need a matching strategy. Common approaches include matching on email address, phone number, or an external identifier (like an SSO user ID). This matching logic belongs in the transformation layer and should produce a canonical entity_id that all sources map to.

Ordering Guarantees

Each database’s CDC stream has a well-defined internal order: transactions are captured in the sequence they were committed. But there is no global order across independent databases. A change in the US database at 2026-02-25T10:00:00Z and a change in the EU database at 2026-02-25T10:00:00Z have no ordering relationship; their clocks may not even agree within seconds.

Timestamp-Based Ordering

The practical solution is to use each record’s source timestamp as the ordering key and accept that ordering across sources is approximate, not exact. This works well for analytics where the question is “what happened and roughly when” rather than “what was the exact global sequence of events.”

Conflict Resolution

When two sources update the same logical entity at nearly the same time, you need a conflict resolution strategy. The most common approaches are:

Last-write-wins (LWW). The update with the latest timestamp overwrites previous values. This is simple and works well when updates to the same entity from different sources are rare.

MERGE INTO global_customers AS target
USING incoming_changes AS source
ON target.source_system = source.source_system
   AND target.original_id = source.original_id
WHEN MATCHED AND source.updated_at > target.updated_at
  THEN UPDATE SET
    email = source.email,
    full_name = source.full_name,
    updated_at = source.updated_at
WHEN NOT MATCHED
  THEN INSERT (source_system, original_id, email, full_name, updated_at)
  VALUES (source.source_system, source.original_id, source.email, source.full_name, source.updated_at);

Source priority. One source is designated as authoritative for specific fields. The billing system’s email is always preferred over the support system’s email, regardless of timestamps.

Manual reconciliation. Conflicting updates are flagged in a reconciliation queue for human review. This is appropriate for high-value entities (large financial accounts, enterprise customers) where automated resolution is too risky.

Source Metadata

Every multi-source merge pipeline should inject metadata columns that identify where each row came from. Without these columns, debugging becomes nearly impossible once data from multiple sources is mingled in the same table.

The essential metadata columns are:

_source_database   VARCHAR(100)   -- 'prod-us-east-pg', 'prod-eu-west-mysql'
_source_table      VARCHAR(100)   -- 'public.orders', 'ecommerce.orders'
_captured_at       TIMESTAMP_TZ   -- When the CDC event was captured

These columns serve three purposes. First, lineage: you can always trace a row back to its origin. Second, debugging: when a data quality issue surfaces, you can immediately narrow it to a specific source. Third, filtering: downstream queries can include or exclude specific sources as needed.

Practical Example: Three Regional PostgreSQL Databases to Snowflake

Consider a concrete scenario. A logistics company operates three PostgreSQL databases (us, eu, and apac), each containing an orders table with the same schema. The goal is a single global_orders table in Snowflake.

Step 1: Configure three CDC source connectors. Each connector reads from its respective PostgreSQL instance’s WAL (write-ahead log) and captures changes to the orders table. With Streamkap, each connector is configured through the UI with the database credentials and table selection.

Step 2: Apply transformations. Each connector adds metadata columns identifying the source region and renames any region-specific columns to the canonical schema. A computed column generates the composite key.

Step 3: Route to a single destination table. All three connectors write to the same global_orders table in Snowflake, using the composite key (source_region, order_id) for upsert logic.

Step 4: Validate. Row counts from each source are compared against the destination, partitioned by _source_database. Any discrepancy triggers an alert.

The resulting table looks like this:

source_region | order_id | customer_id | total_amount | currency | order_date          | _source_database      | _captured_at
--------------+----------+-------------+--------------+----------+---------------------+-----------------------+---------------------
us            | 10001    | 5042        | 149.99       | USD      | 2026-02-25 08:14:00 | prod-us-pg            | 2026-02-25 08:14:01
eu            | 10001    | 8831        | 89.50        | EUR      | 2026-02-25 09:22:00 | prod-eu-pg            | 2026-02-25 09:22:01
apac          | 10001    | 2210        | 12500.00     | JPY      | 2026-02-25 17:05:00 | prod-apac-pg          | 2026-02-25 17:05:01

Note that order_id = 10001 appears three times, once from each region. The composite key prevents collisions while preserving the original identifiers.

Monitoring Multi-Source Pipelines

Multi-source pipelines have more failure modes than single-source ones. Monitoring must account for the independence of each source stream.

Per-Source Lag Tracking

Track replication lag separately for each source connector. A global “pipeline is healthy” indicator masks problems. One source could be hours behind while the others are current. Dashboard your per-source lag independently and alert when any single source exceeds its SLA.

Schema Drift Detection

When multiple sources contribute to the same destination table, a schema change in any one of them affects the merged output. Monitor each source for schema changes and validate that any new columns, type changes, or dropped columns are handled by the transformation layer before they reach the destination.

Row Count Reconciliation

Periodically compare the row count in each source database against the count of rows in the destination table filtered by _source_database. A growing discrepancy indicates missed events, duplication, or a filtering bug in the transformation layer.

-- Reconciliation query in Snowflake
SELECT
  _source_database,
  COUNT(*) AS destination_rows,
  MAX(_captured_at) AS latest_capture
FROM global_orders
GROUP BY _source_database
ORDER BY _source_database;

Compare these counts against the source databases. If prod-eu-pg shows 1,042,000 orders in the source but only 1,040,500 in the destination, you have a 1,500-row gap to investigate.

Dead Letter Queues

Events that fail transformation or conflict resolution should be routed to a dead letter queue rather than dropped silently. Monitor the DLQ size per source and investigate any growth promptly. A healthy multi-source pipeline has an empty or near-empty DLQ.

Getting Started

Multi-source CDC is not inherently more difficult than single-source CDC. It is the same fundamental process repeated across sources with an alignment layer in between. The complexity comes from the details: schema differences, identity resolution, ordering semantics, and monitoring at scale.

The teams that succeed with multi-source merges follow a consistent playbook. They standardize on composite keys early. They inject source metadata from day one. They monitor each source independently. And they use a platform that treats multi-source as a first-class pattern rather than an afterthought.

Streamkap supports running multiple source connectors across PostgreSQL, MySQL, MongoDB, DynamoDB, and more, each with built-in transformations for schema alignment and metadata injection, writing to a single destination table in Snowflake, BigQuery, Databricks, or ClickHouse. If you are building a multi-source pipeline, start a free trial and configure your first merge in minutes.

Products

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company

Multi-Source CDC to a Single Destination: Merging Streams

Why Multi-Source CDC Is Increasingly Common

Use Cases

Regional Database Consolidation

Microservice Data Aggregation

Customer 360

Migration Cutover

Architecture Patterns

Fan-In via Kafka Topics

Connector-Level Transformations

Destination-Level MERGE

Schema Alignment

Column Renaming

Type Normalization

Handling Columns That Exist in One Source but Not Another

Identity Resolution

Composite Keys

UUID Generation

Cross-Source Entity Matching

Ordering Guarantees

Timestamp-Based Ordering

Conflict Resolution

Source Metadata

Practical Example: Three Regional PostgreSQL Databases to Snowflake

Monitoring Multi-Source Pipelines

Per-Source Lag Tracking

Schema Drift Detection

Row Count Reconciliation

Dead Letter Queues

Getting Started