<--- Back to all resources
Multi-Source CDC to a Single Destination: Merging Streams
Learn how to merge CDC streams from multiple databases into a single destination table. Handle schema conflicts, ordering guarantees, and identity resolution across sources.
Most organizations do not run a single database. They run dozens, sometimes hundreds, spread across regions, business units, application services, and technology generations. A company might have customer records in a PostgreSQL cluster in the US, order data in MySQL across three European regions, and product catalog information in MongoDB. Each database is authoritative for its own domain, but no single system holds the complete picture.
Getting that complete picture requires merging change streams from all of those sources into a single destination - a warehouse, a lakehouse, or an analytical database where the data can be queried together. This is multi-source CDC, and while the concept is straightforward, the execution involves a set of challenges that are easy to underestimate.
This guide walks through the architecture, the pitfalls, and the practical techniques for building reliable multi-source CDC pipelines that merge cleanly into a single destination.
Why Multi-Source CDC Is Increasingly Common
Three trends have made multi-source CDC a near-universal requirement for data teams.
Microservices and database-per-service. The microservices architecture pattern encourages each service to own its database. Your authentication service runs PostgreSQL, your billing service runs MySQL, your notifications service runs DynamoDB. Each is optimized for its workload, but the analytics team needs to join data across all of them. Batch ETL can handle this, but the latency is measured in hours. CDC brings it down to seconds.
Regional and multi-tenant databases. Global companies often shard by geography. A SaaS platform might run separate database clusters for North America, Europe, and Asia-Pacific, with identical schemas but different data. Consolidating these into a single analytical store is essential for global reporting, capacity planning, and customer success.
Legacy-to-modern migration. During database migrations, both the old and new systems run simultaneously for weeks or months. CDC from both systems ensures that the analytical warehouse stays current regardless of which system is handling production traffic at any given moment.
Use Cases
Regional Database Consolidation
A retail company operates PostgreSQL databases in three regions: us-east, eu-west, and ap-southeast. Each database has the same orders table with the same schema. The analytics team needs a single global_orders table in Snowflake that contains every order from every region, updated in real time.
Microservice Data Aggregation
An e-commerce platform has separate services for users (PostgreSQL), orders (MySQL), and payments (MongoDB). Building a revenue dashboard requires joining data from all three. Rather than building point-to-point integrations between services, CDC streams from each database are merged into the warehouse where the join happens.
Customer 360
A financial services company stores customer identity in one system, transaction history in another, and support tickets in a third. Creating a unified customer view means merging CDC streams from all three sources into a single customer_360 table, resolving the same customer across systems that use different identifiers.
Migration Cutover
An organization is migrating from Oracle to PostgreSQL. During the transition period, some tables are being served by Oracle and others by PostgreSQL. CDC from both databases feeds the same warehouse, ensuring continuity for downstream analytics regardless of migration progress.
Architecture Patterns
There are three primary patterns for merging CDC streams from multiple sources into a single destination.
Fan-In via Kafka Topics
Each source connector writes to its own Kafka topic. A stream processing layer (Flink, ksqlDB, or a custom consumer) reads from all topics, normalizes the schemas, and writes to a single output topic that the destination connector consumes.
PostgreSQL (us-east) ──CDC──▶ topic: orders.us-east ──┐
PostgreSQL (eu-west) ──CDC──▶ topic: orders.eu-west ──┼──▶ Flink ──▶ topic: global_orders ──▶ Snowflake
PostgreSQL (ap-south) ──CDC──▶ topic: orders.ap-south ─┘
This pattern gives you the most control. The stream processing layer can handle schema normalization, deduplication, filtering, and enrichment before data reaches the destination. The trade-off is operational complexity: you are running and maintaining a stream processing cluster.
Connector-Level Transformations
Some CDC platforms support transformations at the connector level, eliminating the need for a separate stream processing layer. Each source connector applies field renames, type conversions, and metadata injection as part of its processing, then writes directly to the destination.
PostgreSQL (us-east) ──CDC + Transform──▶ Snowflake (global_orders)
MySQL (eu-west) ──CDC + Transform──▶ Snowflake (global_orders)
MongoDB (ap-south) ──CDC + Transform──▶ Snowflake (global_orders)
Streamkap supports this pattern through built-in transformations (field renaming, type mapping, computed columns, and routing) so you can merge multiple sources into a single destination table without deploying Flink or ksqlDB.
Destination-Level MERGE
Each source writes to a separate staging table in the destination. A scheduled SQL job (dbt, stored procedure, or orchestrator) merges the staging tables into a final unified table using MERGE or INSERT ... ON CONFLICT statements.
-- Destination-level merge in Snowflake
MERGE INTO global_orders AS target
USING orders_us_east_staging AS source
ON target.source_system = 'us-east' AND target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
This pattern is simpler to set up but introduces latency (the merge runs on a schedule, not continuously) and puts transformation logic in the warehouse, where compute costs can add up.
Schema Alignment
When sources use different database engines, schema alignment is the first challenge. The same logical entity (an order, a customer, a product) may be represented differently across systems.
Column Renaming
Source A calls it customer_email. Source B calls it email_address. Source C calls it usr_email. Before merging, all three must be mapped to a single canonical column name.
-- Flink SQL example: normalize column names
SELECT
order_id,
customer_email AS email, -- Source A
created_at AS order_date
FROM source_a_orders
UNION ALL
SELECT
id AS order_id,
email_address AS email, -- Source B
order_timestamp AS order_date
FROM source_b_orders;
Type Normalization
MySQL’s DATETIME has no timezone. PostgreSQL’s TIMESTAMPTZ does. MongoDB stores dates as BSON Date objects (milliseconds since epoch). When merging these into a single destination column, you need a consistent type, typically TIMESTAMP_TZ or TIMESTAMP_NTZ with an explicit UTC conversion applied during transformation.
Similarly, MySQL’s TINYINT(1) is often used as a boolean, while PostgreSQL has a native BOOLEAN type, and MongoDB uses a BSON boolean. The transformation layer must normalize all three to the destination’s boolean representation.
Handling Columns That Exist in One Source but Not Another
Source A has a loyalty_tier column. Source B does not. The merged table should include loyalty_tier, but rows from Source B will have NULL for that field. This is straightforward in SQL (NULL AS loyalty_tier), but it requires maintaining a superset schema (the union of all columns across all sources) and keeping that schema updated as individual sources evolve.
Identity Resolution
The most consequential decision in a multi-source merge is how you identify the same entity across systems.
Composite Keys
The simplest approach is to use a composite primary key that includes the source identifier:
CREATE TABLE global_customers (
source_system VARCHAR(50), -- 'us-east', 'eu-west', 'ap-south'
original_id BIGINT,
email VARCHAR(255),
full_name VARCHAR(255),
updated_at TIMESTAMP_TZ,
PRIMARY KEY (source_system, original_id)
);
This guarantees uniqueness: customer ID 12345 from us-east and customer ID 12345 from eu-west are treated as separate rows. This is correct when the sources contain genuinely different entities (regional shards with no cross-region overlap).
UUID Generation
When you need a single global identifier, generate a UUID in the transformation layer:
SELECT
UUID_STRING(MD5(CONCAT('us-east', ':', CAST(customer_id AS VARCHAR)))) AS global_id,
'us-east' AS source_system,
customer_id AS original_id,
email,
full_name
FROM us_east_customers;
The deterministic hash ensures that the same source and original ID always produce the same UUID, making the operation idempotent.
Cross-Source Entity Matching
When the same logical entity exists in multiple sources under different IDs (the same customer in both the billing system and the support system), you need a matching strategy. Common approaches include matching on email address, phone number, or an external identifier (like an SSO user ID). This matching logic belongs in the transformation layer and should produce a canonical entity_id that all sources map to.
Ordering Guarantees
Each database’s CDC stream has a well-defined internal order: transactions are captured in the sequence they were committed. But there is no global order across independent databases. A change in the US database at 2026-02-25T10:00:00Z and a change in the EU database at 2026-02-25T10:00:00Z have no ordering relationship; their clocks may not even agree within seconds.
Timestamp-Based Ordering
The practical solution is to use each record’s source timestamp as the ordering key and accept that ordering across sources is approximate, not exact. This works well for analytics where the question is “what happened and roughly when” rather than “what was the exact global sequence of events.”
Conflict Resolution
When two sources update the same logical entity at nearly the same time, you need a conflict resolution strategy. The most common approaches are:
Last-write-wins (LWW). The update with the latest timestamp overwrites previous values. This is simple and works well when updates to the same entity from different sources are rare.
MERGE INTO global_customers AS target
USING incoming_changes AS source
ON target.source_system = source.source_system
AND target.original_id = source.original_id
WHEN MATCHED AND source.updated_at > target.updated_at
THEN UPDATE SET
email = source.email,
full_name = source.full_name,
updated_at = source.updated_at
WHEN NOT MATCHED
THEN INSERT (source_system, original_id, email, full_name, updated_at)
VALUES (source.source_system, source.original_id, source.email, source.full_name, source.updated_at);
Source priority. One source is designated as authoritative for specific fields. The billing system’s email is always preferred over the support system’s email, regardless of timestamps.
Manual reconciliation. Conflicting updates are flagged in a reconciliation queue for human review. This is appropriate for high-value entities (large financial accounts, enterprise customers) where automated resolution is too risky.
Source Metadata
Every multi-source merge pipeline should inject metadata columns that identify where each row came from. Without these columns, debugging becomes nearly impossible once data from multiple sources is mingled in the same table.
The essential metadata columns are:
_source_database VARCHAR(100) -- 'prod-us-east-pg', 'prod-eu-west-mysql'
_source_table VARCHAR(100) -- 'public.orders', 'ecommerce.orders'
_captured_at TIMESTAMP_TZ -- When the CDC event was captured
These columns serve three purposes. First, lineage: you can always trace a row back to its origin. Second, debugging: when a data quality issue surfaces, you can immediately narrow it to a specific source. Third, filtering: downstream queries can include or exclude specific sources as needed.
Practical Example: Three Regional PostgreSQL Databases to Snowflake
Consider a concrete scenario. A logistics company operates three PostgreSQL databases (us, eu, and apac), each containing an orders table with the same schema. The goal is a single global_orders table in Snowflake.
Step 1: Configure three CDC source connectors. Each connector reads from its respective PostgreSQL instance’s WAL (write-ahead log) and captures changes to the orders table. With Streamkap, each connector is configured through the UI with the database credentials and table selection.
Step 2: Apply transformations. Each connector adds metadata columns identifying the source region and renames any region-specific columns to the canonical schema. A computed column generates the composite key.
Step 3: Route to a single destination table. All three connectors write to the same global_orders table in Snowflake, using the composite key (source_region, order_id) for upsert logic.
Step 4: Validate. Row counts from each source are compared against the destination, partitioned by _source_database. Any discrepancy triggers an alert.
The resulting table looks like this:
source_region | order_id | customer_id | total_amount | currency | order_date | _source_database | _captured_at
--------------+----------+-------------+--------------+----------+---------------------+-----------------------+---------------------
us | 10001 | 5042 | 149.99 | USD | 2026-02-25 08:14:00 | prod-us-pg | 2026-02-25 08:14:01
eu | 10001 | 8831 | 89.50 | EUR | 2026-02-25 09:22:00 | prod-eu-pg | 2026-02-25 09:22:01
apac | 10001 | 2210 | 12500.00 | JPY | 2026-02-25 17:05:00 | prod-apac-pg | 2026-02-25 17:05:01
Note that order_id = 10001 appears three times, once from each region. The composite key prevents collisions while preserving the original identifiers.
Monitoring Multi-Source Pipelines
Multi-source pipelines have more failure modes than single-source ones. Monitoring must account for the independence of each source stream.
Per-Source Lag Tracking
Track replication lag separately for each source connector. A global “pipeline is healthy” indicator masks problems. One source could be hours behind while the others are current. Dashboard your per-source lag independently and alert when any single source exceeds its SLA.
Schema Drift Detection
When multiple sources contribute to the same destination table, a schema change in any one of them affects the merged output. Monitor each source for schema changes and validate that any new columns, type changes, or dropped columns are handled by the transformation layer before they reach the destination.
Row Count Reconciliation
Periodically compare the row count in each source database against the count of rows in the destination table filtered by _source_database. A growing discrepancy indicates missed events, duplication, or a filtering bug in the transformation layer.
-- Reconciliation query in Snowflake
SELECT
_source_database,
COUNT(*) AS destination_rows,
MAX(_captured_at) AS latest_capture
FROM global_orders
GROUP BY _source_database
ORDER BY _source_database;
Compare these counts against the source databases. If prod-eu-pg shows 1,042,000 orders in the source but only 1,040,500 in the destination, you have a 1,500-row gap to investigate.
Dead Letter Queues
Events that fail transformation or conflict resolution should be routed to a dead letter queue rather than dropped silently. Monitor the DLQ size per source and investigate any growth promptly. A healthy multi-source pipeline has an empty or near-empty DLQ.
Getting Started
Multi-source CDC is not inherently more difficult than single-source CDC. It is the same fundamental process repeated across sources with an alignment layer in between. The complexity comes from the details: schema differences, identity resolution, ordering semantics, and monitoring at scale.
The teams that succeed with multi-source merges follow a consistent playbook. They standardize on composite keys early. They inject source metadata from day one. They monitor each source independently. And they use a platform that treats multi-source as a first-class pattern rather than an afterthought.
Streamkap supports running multiple source connectors across PostgreSQL, MySQL, MongoDB, DynamoDB, and more, each with built-in transformations for schema alignment and metadata injection, writing to a single destination table in Snowflake, BigQuery, Databricks, or ClickHouse. If you are building a multi-source pipeline, start a free trial and configure your first merge in minutes.