<--- Back to all resources
MongoDB to Snowflake: Real-Time Document Sync
Stream MongoDB document changes to Snowflake in real time using CDC. Learn how to flatten nested documents, handle schema-on-read data, and build a reliable sync pipeline.
MongoDB is one of the most popular databases for building applications. Its document model gives developers the flexibility to iterate quickly, store deeply nested objects, and evolve schemas without migrations. But at some point, the data trapped inside those documents needs to get out. Analysts want to run SQL queries. Data scientists need to join order data with marketing attribution. Finance needs to reconcile revenue figures. And the destination for all of this, more often than not, is Snowflake.
The MongoDB-to-Snowflake pipeline is one of the most common data integration patterns in modern stacks. The challenge is that these two systems speak fundamentally different languages. MongoDB stores flexible, nested BSON documents. Snowflake expects structured, columnar data. Bridging that gap in real time, without losing data or mangling schemas, is the core problem this guide addresses.
MongoDB CDC Mechanisms
MongoDB provides two primary mechanisms for capturing changes: Change Streams and oplog tailing. Understanding the difference matters because it determines how resilient and maintainable your pipeline will be.
Change Streams
Change Streams, introduced in MongoDB 3.6, are the recommended approach. They provide an application-level API built on top of MongoDB’s internal oplog. You open a stream against a collection, database, or entire deployment, and MongoDB delivers structured change events as they happen.
const pipeline = [];
const changeStream = db.collection('orders').watch(pipeline, {
fullDocument: 'updateLookup'
});
changeStream.on('change', (event) => {
console.log(event.operationType); // insert, update, replace, delete
console.log(event.fullDocument); // the complete document after the change
console.log(event._id); // resume token
});
Change Streams require a replica set or sharded cluster. Standalone MongoDB instances do not support them. If you are on MongoDB Atlas, this requirement is already satisfied. For self-hosted deployments, you need at least a three-member replica set.
Oplog Tailing
Before Change Streams existed, the only option was to read directly from the oplog, a capped collection in the local database that records every write operation. Oplog tailing gives you lower-level access but requires you to handle resumability, filtering, and event parsing yourself.
const oplog = db.getSiblingDB('local').getCollection('oplog.rs');
const cursor = oplog.find({
ts: { $gt: lastTimestamp },
ns: 'mydb.orders'
}).tailable(true).awaitData(true);
Most modern CDC tools, including Debezium and managed platforms, now use Change Streams rather than raw oplog tailing. Change Streams handle connection failures, resume tokens, and event structuring automatically, which eliminates an entire class of bugs you would otherwise have to solve yourself.
Architecture Patterns
There are three common architectures for moving MongoDB data to Snowflake in real time. The right choice depends on your latency requirements, existing infrastructure, and operational appetite.
Direct CDC to Snowflake
The simplest architecture reads Change Streams and writes directly to Snowflake using Snowpipe Streaming or the Snowflake Ingest SDK. There is no intermediate message broker. This approach has fewer moving parts and lower latency, but it tightly couples your source and destination. If Snowflake is unavailable, the pipeline must buffer events or risk data loss.
CDC via Kafka
The most common production architecture places Apache Kafka (or a Kafka-compatible system like Redpanda) between MongoDB and Snowflake. A Kafka Connect source connector reads Change Streams and publishes events to Kafka topics. A separate sink connector reads from those topics and loads data into Snowflake.
MongoDB → Change Streams → Kafka Connect Source → Kafka → Kafka Connect Sink → Snowflake
Kafka provides durable buffering, replayability, and the ability to fan out the same change events to multiple destinations. The trade-off is operational complexity. You are now managing Kafka brokers, connectors, schemas, and consumer lag.
Managed CDC Platform
Managed platforms like Streamkap abstract the entire pipeline. You configure a MongoDB source and a Snowflake destination through a UI, and the platform handles Change Stream management, event buffering, document transformation, and Snowflake loading. This approach delivers the lowest operational overhead while maintaining the durability guarantees of the Kafka-based architecture under the hood.
Document-to-Table Mapping
The fundamental tension in a MongoDB-to-Snowflake pipeline is mapping flexible documents to rigid tables. MongoDB documents can have different fields from one document to the next. Snowflake tables have fixed columns. There are three strategies for resolving this.
VARIANT Columns
Snowflake’s VARIANT data type can store entire JSON documents. You load each MongoDB document as a single VARIANT column and query it using Snowflake’s semi-structured data functions.
CREATE TABLE raw_orders (
_id STRING,
document VARIANT,
_streamkap_ts_ms TIMESTAMP
);
-- Query nested fields directly
SELECT
document:customer.name::STRING AS customer_name,
document:total::NUMBER(10,2) AS order_total,
document:items[0].sku::STRING AS first_item_sku
FROM raw_orders;
This approach preserves full document fidelity and handles schema changes gracefully because you never need to alter the table. The downside is that queries on VARIANT data are slower than queries on native columns, and analysts need to know the document structure to write correct queries.
Full Flattening
The opposite approach is to flatten every top-level field into its own Snowflake column at ingestion time. A document like { _id: "abc", status: "shipped", total: 99.50 } becomes a row with columns _id, status, and total.
This makes the data immediately queryable with standard SQL and works well when your documents have a relatively stable schema. But it breaks down when documents are deeply nested or when different documents in the same collection have wildly different fields.
Hybrid Approach
The pragmatic middle ground is to flatten well-known, frequently queried fields into columns while also storing the full document as a VARIANT column. This gives analysts fast access to common fields and a fallback for everything else.
CREATE TABLE orders (
_id STRING,
status STRING,
total NUMBER(10,2),
customer_name STRING,
raw_document VARIANT,
_streamkap_ts_ms TIMESTAMP
);
Handling Nested Arrays and Objects
MongoDB documents regularly contain arrays of embedded objects, such as an order with a list of line items. Getting this data into Snowflake in a queryable form requires special handling.
LATERAL FLATTEN
Snowflake’s LATERAL FLATTEN function explodes an array inside a VARIANT column into rows. Each element of the array becomes a separate row, joined with the parent fields.
SELECT
o.document:_id::STRING AS order_id,
o.document:customer.name::STRING AS customer_name,
item.value:sku::STRING AS sku,
item.value:quantity::NUMBER AS quantity,
item.value:price::NUMBER(10,2) AS price
FROM raw_orders o,
LATERAL FLATTEN(input => o.document:items) item;
This is powerful but comes with a caveat: array explosion. If an order has 10 items, one document produces 10 rows. Analysts need to be aware of this when aggregating to avoid double-counting parent-level fields. Wrapping aggregations at the item level before joining back to the order level is the standard pattern.
Pre-Materialized Views
For frequently queried array data, creating materialized views or separate tables for the nested arrays simplifies downstream analytics. You can use Snowflake tasks or streams to maintain a line_items table that is always in sync with the orders table.
Schema Evolution
MongoDB’s schema-on-read model means documents in the same collection can have different shapes. A document written in 2023 might have a shipping_address field as a string, while a 2025 document has it as a nested object with street, city, and zip fields. Your pipeline needs to handle this gracefully.
New Fields
When a new field appears in a MongoDB document, the pipeline must decide whether to add a new column to the Snowflake table or simply let it land in the VARIANT column. Automatically adding columns is convenient but can lead to tables with hundreds of sparse columns over time.
Managed CDC platforms typically handle this automatically. Streamkap, for example, detects new fields in the incoming documents and can add corresponding columns to the target Snowflake table without manual intervention or pipeline restarts.
Type Changes
Type changes are harder. If a field changes from a string to an integer, Snowflake will reject the row unless the column type can accommodate both. The safest approaches are: (1) use STRING columns for fields with unstable types, (2) use VARIANT columns for those fields, or (3) implement a transformation layer that coerces types before loading.
Updates and Deletes
How your pipeline handles updates and deletes has a direct impact on data accuracy in Snowflake.
Full Document vs. Delta Updates
MongoDB Change Streams can deliver updates in two modes. By default, an update event contains only the fields that changed (the delta). With fullDocument: 'updateLookup', MongoDB performs a read-back against the current state of the document and includes the full document in the change event.
Full document mode is almost always preferable for Snowflake pipelines. It means your sink can simply overwrite the existing row with the complete current state of the document, rather than applying partial updates. The trade-off is an additional read against MongoDB for each update event, but for most workloads this overhead is negligible.
Delete Handling
When a document is deleted in MongoDB, the change event includes the document’s _id but not the full document. Your pipeline needs to decide how to represent this in Snowflake:
- Hard delete: Remove the row from Snowflake. This keeps the table clean but loses the audit trail.
- Soft delete: Set a
_deletedflag or_deleted_attimestamp on the row. This preserves history and is generally preferred for analytics.
-- Soft delete pattern
UPDATE orders
SET _deleted = TRUE, _deleted_at = CURRENT_TIMESTAMP()
WHERE _id = '64a7f2e1b940149be323bb6a';
Practical Example: Orders Collection to Snowflake
Consider a MongoDB orders collection with documents like this:
{
"_id": ObjectId("64a7f2e1b940149be323bb6a"),
"orderNumber": "ORD-2026-1042",
"customer": {
"name": "Acme Corp",
"email": "procurement@acme.com",
"tier": "enterprise"
},
"items": [
{ "sku": "WIDGET-A", "quantity": 50, "price": 12.99 },
{ "sku": "GADGET-B", "quantity": 10, "price": 49.99 }
],
"status": "processing",
"total": 1149.40,
"createdAt": ISODate("2026-02-25T14:30:00Z")
}
A typical pipeline would:
- Capture: Read Change Streams with
fullDocument: 'updateLookup'enabled. - Transform: Flatten
customer.name,customer.email,status,total, andcreatedAtinto columns. Store the full document as a VARIANT column. - Load: Insert or merge into the Snowflake
orderstable using_idas the primary key. - Explode arrays: Maintain a separate
order_itemstable using LATERAL FLATTEN or a Snowflake stream/task pair.
The resulting Snowflake tables look like this:
-- Main orders table
SELECT order_number, customer_name, status, total, created_at
FROM orders
WHERE status = 'processing';
-- Line items via LATERAL FLATTEN on raw document
SELECT
o._id AS order_id,
item.value:sku::STRING AS sku,
item.value:quantity::NUMBER AS qty,
item.value:price::NUMBER(10,2) AS unit_price
FROM orders o,
LATERAL FLATTEN(input => o.raw_document:items) item
WHERE o.order_number = 'ORD-2026-1042';
Self-Managed vs. Managed
Building and operating a MongoDB-to-Snowflake CDC pipeline yourself is possible, but the effort compounds quickly.
A self-managed pipeline requires you to:
- Deploy and maintain Kafka (or buffer events in another durable store)
- Configure and monitor a Debezium MongoDB source connector
- Build and maintain document flattening logic
- Handle Change Stream resume token persistence
- Manage Snowflake schema migrations when documents evolve
- Monitor connector health, consumer lag, and data freshness
- Handle error recovery, dead-letter queues, and data reconciliation
Each of these is a solvable problem individually, but together they represent a significant ongoing engineering commitment. Teams that start with a self-managed approach frequently report spending 20-40 hours per month on pipeline maintenance.
A managed platform like Streamkap reduces this to configuration. You point it at your MongoDB replica set and your Snowflake account, configure flattening and schema evolution preferences, and the platform handles everything else, including Change Stream management, resume token persistence, document transformation, Snowflake loading via Snowpipe Streaming, and automatic schema evolution.
Performance Considerations
Getting the pipeline to work is one thing. Getting it to perform well at scale requires attention to several details.
Change Stream Resume Tokens
Every change event includes a resume token that allows the stream to restart from exactly where it left off after a failure. Your pipeline must persist these tokens durably. If you lose a resume token and the relevant oplog entries have been overwritten (the oplog is a capped collection with limited retention), you will need to re-snapshot the entire collection. For high-throughput collections, this can take hours.
Batch Size and Flush Intervals
Loading data into Snowflake row-by-row is prohibitively expensive. CDC pipelines must batch events and flush them at configurable intervals. A typical starting point is to flush every 10 seconds or every 10,000 records, whichever comes first. Tuning these values is a balance between latency (how quickly changes appear in Snowflake) and cost (Snowflake charges per query execution, so fewer, larger loads are cheaper).
Snowflake Warehouse Sizing
For continuous CDC workloads, a Snowflake X-Small warehouse is usually sufficient for moderate throughput (up to tens of thousands of events per second). The warehouse only needs to be active during data loading. If you are using Snowpipe Streaming, Snowflake manages compute automatically and charges based on the volume of data loaded, which is often more cost-effective for steady CDC streams.
MongoDB Read Preference
When using fullDocument: 'updateLookup', each update event triggers a read against MongoDB. For high-update workloads, direct these reads to a secondary replica by setting the appropriate read preference on the Change Stream. This keeps the load off your primary node and avoids impacting application performance.
const changeStream = db.collection('orders').watch([], {
fullDocument: 'updateLookup',
readPreference: 'secondaryPreferred'
});
Oplog Sizing
The oplog must be large enough to retain events for the duration of any potential pipeline downtime. If your pipeline goes down for maintenance and the oplog wraps around before it resumes, you lose events. For production pipelines, size the oplog to retain at least 24-72 hours of operations and monitor oplog window size as a key metric.
Getting MongoDB data into Snowflake in real time is not just about connecting two systems. It is about translating between two fundamentally different data models while maintaining correctness, handling schema drift, and operating reliably at scale. Whether you build it yourself or use a managed platform, understanding these mechanics gives you the foundation to make the right architectural choices for your data stack.