<--- Back to all resources
MongoDB Change Data Capture: A Complete Guide to Real-Time CDC
Learn how to implement MongoDB Change Data Capture (CDC) for real-time streaming. Covers Change Streams, replica sets, Atlas setup, and managed CDC solutions.
MongoDB’s document model is fantastic for building applications. Its flexible schemas, nested documents, and rich query language make it one of the most popular databases in the world. But when you need to stream those document changes out of MongoDB and into other systems in real-time—a data warehouse, an analytics platform, a search index—things get harder than you might expect.
The challenge isn’t just reading data from MongoDB. It’s capturing every insert, update, replace, and delete as it happens, in the correct order, without hammering your production database with constant queries. This is the problem that MongoDB Change Data Capture (CDC) solves, and it’s become an essential capability for any modern data architecture built on MongoDB.
What Is MongoDB Change Data Capture?
At its core, Change Data Capture is the process of identifying and capturing changes made to data in a database, then delivering those changes to downstream systems in real-time. Instead of running expensive batch queries to figure out what’s different since the last check, CDC gives you a continuous, ordered stream of every modification.
For MongoDB specifically, CDC means capturing document-level operations—inserts, updates, replaces, and deletes—from your collections and streaming them out as discrete events. Each event tells you exactly what changed, when it changed, and what the document looks like now.
The Old Way: Polling for Changes
Before MongoDB had native CDC support, teams resorted to polling. The idea was straightforward but painful: run a query against your collection every few seconds or minutes, compare the results to what you saw last time, and figure out the differences.
// The polling anti-pattern — don't do this in production
const lastChecked = new Date(Date.now() - 60000); // 1 minute ago
const changes = db.orders.find({
updatedAt: { $gte: lastChecked }
});
This approach has deep, fundamental problems. First, it only works if every document has a reliable updatedAt field that your application consistently maintains—and that’s a big assumption. Second, it completely misses deletes. Once a document is gone, there’s nothing left to query. Third, the polling interval creates an inherent delay. If you poll every 60 seconds, your downstream systems are always at least a minute behind reality. And the more frequently you poll to close that gap, the more load you pile onto your production database.
Polling is a lose-lose trade-off between data freshness and database performance. It’s the reason MongoDB built something much better.
The Modern Way: Change Streams
MongoDB Change Streams are the native, purpose-built mechanism for CDC in MongoDB. Introduced in MongoDB 3.6 and significantly enhanced in 4.0 and later versions, Change Streams give you a real-time, ordered feed of document-level changes without any of the drawbacks of polling.
Change Streams work by tapping into MongoDB’s oplog (operations log)—the internal journal that MongoDB already maintains for replication between replica set members. Because the oplog exists regardless of whether you’re using CDC, reading from it adds virtually zero overhead to your database.
The difference between polling and Change Streams is like the difference between constantly refreshing a web page to check for new emails versus having push notifications enabled. One is wasteful and always behind; the other is instant and effortless.
How MongoDB Change Streams Work
To really understand MongoDB CDC, you need to understand the oplog and how Change Streams are built on top of it. This isn’t just academic knowledge—it directly affects how you configure, monitor, and troubleshoot your CDC pipelines.
The Oplog: MongoDB’s Transaction Journal
The oplog (short for operations log) is a special capped collection that lives in the local database of every replica set member. Every write operation that modifies data—whether it’s an insert, update, or delete—gets recorded in the oplog as an idempotent operation.
MongoDB uses the oplog primarily for replication. When you write to the primary node, that operation is recorded in the primary’s oplog. Secondary nodes then continuously read from the primary’s oplog and apply those same operations to stay in sync. It’s an elegant system, and Change Streams essentially let your application act as another consumer of this same oplog stream.
You can inspect the oplog directly to understand what it looks like:
// View recent oplog entries
use local
db.oplog.rs.find().sort({ $natural: -1 }).limit(5).pretty()
Each oplog entry contains a timestamp, the operation type (i for insert, u for update, d for delete), the namespace (database and collection), and the actual data or modification.
Change Events and Their Structure
When you open a Change Stream, MongoDB doesn’t hand you raw oplog entries. Instead, it translates them into well-structured change events that are much easier to work with. Here’s what a typical change event looks like for an insert operation:
{
"_id": { "_data": "8263..." },
"operationType": "insert",
"clusterTime": Timestamp(1699000000, 1),
"fullDocument": {
"_id": ObjectId("6554..."),
"customerId": "cust-12345",
"product": "Enterprise Plan",
"amount": 299.00,
"status": "active",
"createdAt": ISODate("2026-02-10T10:30:00Z")
},
"ns": {
"db": "sales",
"coll": "orders"
},
"documentKey": { "_id": ObjectId("6554...") }
}
The operationType field tells you what happened. MongoDB Change Streams support several operation types: insert, update, replace, delete, drop, rename, dropDatabase, and invalidate. For update operations, you get a updateDescription field that shows exactly which fields were modified:
{
"operationType": "update",
"updateDescription": {
"updatedFields": { "status": "shipped", "shippedAt": ISODate("2026-02-10T14:00:00Z") },
"removedFields": [],
"truncatedArrays": []
},
"fullDocument": { ... },
"documentKey": { "_id": ObjectId("6554...") }
}
Starting with MongoDB 4.0, you can request the fullDocument option set to "updateLookup", which tells MongoDB to include the complete current state of the document with every update event—not just the delta. This is incredibly useful when your downstream systems need the full picture, not just the changed fields. MongoDB 6.0 went even further, introducing the fullDocumentBeforeChange option, giving you both the “before” and “after” states of a document on every change—essential for audit trails and certain analytical workloads.
Resume Tokens: Your Bookmark in the Stream
One of the most important features of Change Streams is the resume token. Every change event includes a unique _id field (shown as _data in the examples above) that acts as a bookmark in the oplog. If your application crashes, loses its connection, or simply needs to restart, it can pass this resume token back to MongoDB and pick up the stream exactly where it left off.
// Opening a Change Stream with a resume token
const resumeToken = loadSavedResumeToken(); // from your checkpoint store
const changeStream = db.collection("orders").watch([], {
resumeAfter: resumeToken
});
This mechanism is what makes MongoDB CDC reliable and durable. You don’t lose events because of a network hiccup or a scheduled restart. As long as the oplog still contains the events from your resume point forward, you can always catch up. This is one reason why oplog sizing matters—a topic we’ll cover in the best practices section.
Prerequisites for MongoDB CDC
Before you can start capturing changes from MongoDB, there are a few hard requirements and important considerations to be aware of.
Replica Set or Sharded Cluster (Non-Negotiable)
This is the single most important prerequisite: Change Streams only work on replica sets or sharded clusters. A standalone MongoDB instance does not maintain an oplog, so there’s nothing for Change Streams to read from.
If you’re running MongoDB locally for development as a standalone instance, you’ll need to convert it to a single-node replica set:
// Convert standalone to a single-node replica set
// In mongosh:
rs.initiate({
_id: "rs0",
members: [{ _id: 0, host: "localhost:27017" }]
})
In production, you should always be running a replica set anyway for high availability. If you’re on MongoDB Atlas, every cluster is already configured as a replica set, so this requirement is handled for you automatically.
MongoDB Version Requirements
While Change Streams were introduced in MongoDB 3.6, you’ll want to be on MongoDB 4.0 or later for production CDC. Here’s why:
- MongoDB 3.6: Basic Change Streams support. You can watch individual collections only.
- MongoDB 4.0: Added the ability to watch an entire database or the whole deployment. Introduced
fullDocument: "updateLookup"for getting the complete document state on updates. Added support for Change Streams on sharded clusters without restrictions. - MongoDB 4.2: Improved pipeline support within Change Streams.
- MongoDB 6.0: Added
fullDocumentBeforeChangepre-image support, letting you capture the document state before a change occurred. - MongoDB 7.0+: Further performance optimizations and enhanced change stream resilience.
For most CDC use cases, MongoDB 4.0 is the practical minimum, and 6.0+ is ideal if you need before-and-after document images.
User Permissions
The MongoDB user account your CDC tool connects with needs specific privileges. At a minimum, you need:
changeStreamprivilege on the target databases or collectionsfindprivilege on the target databases or collections (required forfullDocument: "updateLookup")
For a dedicated CDC user, you’d create a role like this:
// Create a dedicated CDC user with the necessary privileges
use admin
db.createRole({
role: "cdcReader",
privileges: [
{
resource: { db: "sales", collection: "" },
actions: ["changeStream", "find"]
}
],
roles: []
});
db.createUser({
user: "cdc_service",
pwd: "secure_password_here",
roles: [{ role: "cdcReader", db: "admin" }]
});
If you want to capture changes across all databases, use { db: "", collection: "" } as the resource.
MongoDB Atlas Requirements
If you’re running on MongoDB Atlas—which is the most common production setup—there are a couple of additional considerations:
- Cluster tier: You need an M10 or higher cluster. The free tier (M0) and shared tiers (M2/M5) don’t support the connection types needed for CDC connectors.
- Network access: Your CDC tool needs to be able to reach your Atlas cluster. This means configuring IP allowlists or setting up VPC peering / Private Link.
- All Atlas regions are supported by managed CDC platforms like Streamkap, so you don’t need to worry about geographic restrictions.
MongoDB CDC Architecture Options
When you’re ready to implement MongoDB CDC, the architecture broadly falls into two categories: a self-managed pipeline using open-source tools, or a fully managed platform that handles the infrastructure for you.
In both cases, the data flow follows a similar pattern:
MongoDB (Change Streams) —> CDC Connector —> Message Broker or Direct Delivery —> Destination (Snowflake, ClickHouse, Kafka, etc.)
The connector opens a Change Stream against your MongoDB deployment, receives change events, transforms them into a suitable format, and delivers them downstream. The key difference between the two approaches is who builds, operates, and maintains that connector and everything around it.
Let’s look at both options in detail so you can make an informed decision.
Option 1: DIY CDC with Kafka Connect and Debezium
The most common self-managed approach uses the MongoDB Kafka Connector or Debezium’s MongoDB connector running inside Kafka Connect, with Apache Kafka as the message broker. This is a proven architecture, but it comes with substantial operational overhead.
Components You’ll Need
Before you write a single line of configuration, you’ll need to provision and manage all of the following:
- Apache Kafka cluster (3+ brokers for production)
- Apache ZooKeeper (or KRaft for newer Kafka versions)
- Kafka Connect workers (at least 2 for high availability)
- Schema Registry (for managing Avro or JSON schemas)
- Monitoring stack (Prometheus, Grafana, or similar)
- The MongoDB connector plugin installed on your Kafka Connect workers
That’s a lot of moving parts. Each one needs to be provisioned, configured, monitored, patched, and scaled independently. For a team with deep Kafka expertise, this is manageable. For everyone else, it’s a significant distraction from actual product work.
Setting Up the Debezium MongoDB Connector
Assuming you have the Kafka infrastructure in place, you’d deploy the Debezium MongoDB connector by posting a JSON configuration to the Kafka Connect REST API:
{
"name": "mongodb-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.connection.string": "mongodb+srv://cdc_service:password@cluster0.example.mongodb.net",
"topic.prefix": "mongodb",
"collection.include.list": "sales.orders,sales.customers",
"capture.mode": "change_streams_update_full",
"snapshot.mode": "initial",
"mongodb.ssl.enabled": "true",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": false,
"value.converter.schemas.enable": false
}
}
You’d then deploy this to Kafka Connect:
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d @mongodb-cdc-connector.json
Once deployed, the connector performs an initial snapshot of your specified collections, then transitions to reading from Change Streams for ongoing changes. Each change event ends up as a message in a Kafka topic like mongodb.sales.orders.
The Hidden Costs of Self-Managing
On paper, this looks clean. In practice, here’s what teams consistently run into:
- Kafka cluster management: Broker rebalancing, partition management, retention policies, disk space monitoring, and upgrades are a constant tax on engineering time.
- Connector failures: When a Debezium connector fails (and it will—network blips, MongoDB failovers, schema changes), someone needs to diagnose the issue, potentially reset offsets, and restart the connector. At 3 AM.
- Schema evolution: MongoDB’s flexible schema is a double-edged sword for CDC. When a document’s structure changes, your downstream consumers need to handle the new shape. Debezium can propagate schema changes, but managing this across a pipeline requires careful planning.
- Scaling: As your data volume grows, you need to add Kafka brokers, repartition topics, and scale Kafka Connect workers. None of this is automatic.
- Monitoring and alerting: You need visibility into connector lag, consumer group offsets, Kafka broker health, and Change Stream resume token freshness. Building this observability layer is a project in itself.
For organizations with a dedicated data platform team and existing Kafka expertise, this approach can work well. But for teams that just want their MongoDB data to show up in Snowflake or ClickHouse without building a distributed systems project, there’s a much simpler path.
Option 2: Managed CDC with Streamkap
A managed CDC platform like Streamkap takes the same underlying technology—MongoDB Change Streams—and wraps it in a fully operated service that eliminates the infrastructure burden entirely.
Instead of provisioning Kafka clusters, deploying connectors, and building monitoring dashboards, you configure a source (MongoDB) and a destination (your data warehouse or streaming platform), and the platform handles everything in between.
How Streamkap Approaches MongoDB CDC
Streamkap connects directly to your MongoDB deployment using native Change Streams. Here’s what that means in practice:
- Zero infrastructure to manage: No Kafka clusters, no Kafka Connect workers, no ZooKeeper. Streamkap runs the entire pipeline as a managed service.
- Zero performance impact on your source database: Because Change Streams read from the oplog—which MongoDB maintains regardless—there’s no additional load on your production database. It’s the same mechanism MongoDB uses for its own internal replication.
- Sub-second latency: Changes flow from MongoDB to your destination in under a second. This isn’t batch processing with a 5-minute window; it’s a genuine real-time stream.
- Automatic schema evolution: MongoDB’s flexible document model means schemas change frequently. Fields get added, nested documents evolve, arrays grow new element shapes. Streamkap handles this automatically, propagating schema changes to your destination without manual intervention or pipeline breakdowns.
- Exactly-once delivery guarantee: Every document change is delivered to your destination exactly once. No duplicates, no missed events.
- Support for MongoDB Atlas M10+ and self-hosted deployments: Whether you’re on Atlas in any region or running MongoDB on your own infrastructure, Streamkap connects to it.
The result is a MongoDB CDC pipeline that takes minutes to set up instead of weeks, costs up to 90% less than traditional ETL approaches, and requires no ongoing infrastructure maintenance from your team. You can explore all of Streamkap’s CDC capabilities to see the full picture.
Streaming MongoDB to Popular Destinations
Once you have MongoDB CDC running, the question becomes: where do you send the data? Here are the most common destinations and what to consider for each.
MongoDB to Snowflake
Snowflake is one of the most popular destinations for MongoDB CDC data, especially for analytics and business intelligence workloads. The challenge has traditionally been getting MongoDB’s nested, document-oriented data into Snowflake’s columnar format without losing fidelity or introducing massive latency.
Streamkap solves this by streaming MongoDB change events directly into Snowflake using Snowpipe Streaming, which eliminates the staging files and batch delays of traditional loading approaches. Documents are flattened into Snowflake-friendly structures automatically, and schema changes in MongoDB are propagated to Snowflake tables without manual DDL work.
For a deep dive into this specific pipeline, including how Dynamic Tables can further reduce costs, check out our guide on streaming MongoDB to Snowflake with Snowpipe Streaming and Dynamic Tables. You can also learn more about the Snowflake destination connector.
MongoDB to ClickHouse
For teams that need blazing-fast analytical queries on high-volume data, ClickHouse is an excellent destination. ClickHouse’s columnar storage and vectorized query execution make it ideal for real-time dashboards and operational analytics.
Streaming MongoDB CDC events to ClickHouse gives you the ability to run sub-second analytical queries on data that’s only seconds old. This combination—MongoDB for your application layer and ClickHouse for your analytical layer—is a powerful pattern for teams that need both flexibility and speed.
MongoDB to Apache Kafka
Sometimes you don’t want to send MongoDB changes directly to a single destination. Instead, you want them published to Apache Kafka so that multiple consumers can subscribe to the same stream of events. This is common in event-driven architectures where different teams or services need access to the same data changes for different purposes.
With Streamkap, you can stream MongoDB CDC events into Kafka topics without managing Kafka Connect yourself. Your downstream consumers—whether they’re microservices, analytics tools, or machine learning pipelines—can then consume from Kafka at their own pace.
MongoDB to Apache Iceberg
For organizations building a data lakehouse architecture, Apache Iceberg is becoming the table format of choice. Streaming MongoDB changes into Iceberg tables gives you the best of both worlds: the real-time freshness of CDC with the cost efficiency and openness of a lakehouse.
You can explore all available destinations on the Streamkap connectors page.
MongoDB CDC Best Practices
Getting MongoDB CDC running is one thing. Keeping it running reliably in production is another. Here are the practices that separate a fragile proof of concept from a production-grade pipeline.
Size Your Oplog Appropriately
The oplog is a capped collection, which means it has a fixed maximum size. Once it fills up, the oldest entries are overwritten. If your CDC consumer falls behind—due to a network outage, a slow destination, or a processing error—and the oplog wraps around past its resume point, your Change Stream will be invalidated. At that point, you’d need to perform a full re-snapshot of your data, which can be expensive and disruptive.
The default oplog size is typically 5% of free disk space, with a minimum of 990 MB and a maximum of 50 GB. For CDC workloads, you should size it based on how long you want to be able to survive a consumer outage:
// Check current oplog size and time range
use local
db.oplog.rs.stats().maxSize // Maximum size in bytes
db.oplog.rs.find().sort({ $natural: 1 }).limit(1) // Oldest entry
db.oplog.rs.find().sort({ $natural: -1 }).limit(1) // Newest entry
A common rule of thumb is to ensure your oplog can hold at least 24-72 hours of write activity. If your application generates a lot of writes, you may need to increase the oplog size:
// Resize oplog to 10 GB (run on each replica set member)
db.adminCommand({ replSetResizeOplog: 1, size: 10240 })
On MongoDB Atlas, you can configure the oplog window in your cluster settings without any downtime.
Manage Resume Tokens Carefully
Resume tokens are your lifeline for recovering from interruptions. If you’re building a custom consumer (rather than using a managed platform), you need to persist resume tokens durably after each batch of events is successfully processed.
The key principle: save the resume token only after you’ve confirmed the corresponding events have been delivered to your destination. If you save the token too early and your process crashes before delivering the events, those events are lost. If you save it too late, you might reprocess some events—which is usually tolerable with idempotent processing but still wasteful.
// Pattern for durable resume token management
let resumeToken = loadFromCheckpointStore();
const pipeline = [];
const options = resumeToken ? { resumeAfter: resumeToken } : {};
const changeStream = db.collection("orders").watch(pipeline, options);
changeStream.on("change", async (event) => {
await deliverToDestination(event); // 1. Deliver the event
await saveToCheckpointStore(event._id); // 2. Then save the token
resumeToken = event._id;
});
Managed platforms like Streamkap handle resume token management automatically, including edge cases like oplog rollover and MongoDB failovers.
Design for MongoDB’s Flexible Schema
Unlike relational databases where schema changes require explicit ALTER TABLE statements, MongoDB documents in the same collection can have completely different structures. One document might have 5 fields while another has 50. A field that’s a string in one document could be a nested object in another.
This flexibility is powerful for application development, but it creates challenges for CDC pipelines that need to deliver data to schema-enforced destinations like Snowflake or ClickHouse. Here are a few practices that help:
- Use schema validation rules in MongoDB to enforce some level of consistency at the collection level. This won’t eliminate schema variation entirely, but it reduces surprises.
- Plan for additive changes: New fields appearing in documents should be handled automatically by your CDC pipeline. Managed platforms like Streamkap detect new fields and add corresponding columns to your destination tables.
- Handle type conflicts carefully: If the same field contains different BSON types across documents (e.g.,
priceis sometimes a number and sometimes a string), your CDC pipeline needs a strategy for this. Some platforms coerce types; others create separate columns.
Monitor Your CDC Pipeline
Regardless of whether you’re self-managing or using a managed platform, you should be tracking these metrics:
- Consumer lag: How far behind is your CDC consumer relative to the latest oplog entry? This should be measured in both time and operations.
- Event throughput: How many change events per second is your pipeline processing? A sudden drop could indicate a problem.
- Oplog window: How much historical data does your oplog currently hold? If this shrinks below your acceptable recovery window, you need to either increase oplog size or investigate why write volume has spiked.
- Resume token freshness: How recent is your last saved resume token? A stale token means you’re at risk if the consumer restarts.
- Destination write latency: How quickly are events being written to your destination after being read from MongoDB?
Performance Considerations
One of the biggest concerns teams have when considering MongoDB CDC is whether it will slow down their production database. Let’s address this directly.
Oplog Reading Is Lightweight
Change Streams read from the oplog, which is a tailed cursor on a capped collection. This is the same mechanism MongoDB’s own replication uses. Adding a CDC consumer is roughly equivalent to adding another secondary replica reader—the impact is typically negligible.
The oplog is already being written to as part of normal MongoDB operations. CDC doesn’t cause additional writes; it only reads what’s already there. This is fundamentally different from polling-based approaches that execute queries against your working data collections and consume CPU, memory, and I/O resources on your primary node.
Connection Pooling and Resource Usage
Each Change Stream cursor maintains a persistent connection to MongoDB. If you’re watching many collections individually, this can add up. A more efficient approach is to watch at the database level or deployment level and filter events in your application:
// Watch all collections in the 'sales' database with a filter
const pipeline = [
{ $match: { "ns.coll": { $in: ["orders", "customers", "products"] } } }
];
const changeStream = db.watch(pipeline);
This uses a single connection and a single oplog tailing cursor, which is much more efficient than opening separate Change Streams for each collection.
Network Bandwidth
For high-volume MongoDB deployments, the change event stream can generate significant network traffic, especially if you’re requesting full documents on every update. Consider these strategies to manage bandwidth:
- Use
$matchand$projectstages in your Change Stream pipeline to filter out events and fields you don’t need. - If you only need to know what changed (not the full document), skip the
fullDocument: "updateLookup"option and work with theupdateDescriptiondelta instead. - Ensure your CDC consumer is in the same region or availability zone as your MongoDB deployment to minimize latency and egress costs.
Handling High-Write Workloads
If your MongoDB deployment handles thousands of writes per second, your CDC pipeline needs to keep pace. Key considerations include:
- Parallelism: For sharded clusters, Change Streams can be consumed per-shard, allowing parallel processing.
- Batching: Rather than processing events one at a time, batch them before writing to your destination. This dramatically improves throughput for destinations like Snowflake or ClickHouse that are optimized for bulk inserts.
- Backpressure: Your pipeline should handle backpressure gracefully. If the destination is temporarily slow, the pipeline should buffer events rather than dropping them or crashing.
Managed platforms like Streamkap handle all of these concerns automatically, dynamically scaling processing capacity based on your change volume.
Getting Started with MongoDB CDC
If you’ve made it this far, you understand the landscape: MongoDB Change Streams provide a powerful native foundation for CDC, but turning that foundation into a production-grade pipeline requires significant effort—unless you use a managed platform that handles the complexity for you.
Here’s the honest assessment. If your team has deep Kafka expertise, a dedicated data platform team, and the time to build and maintain a self-managed pipeline, the Debezium/Kafka approach gives you maximum control. But if you’d rather focus on building your product and getting value from your data—rather than operating distributed infrastructure—a managed platform is the clear choice.
Streamkap gives you production-ready MongoDB CDC with:
- Minutes to set up, not weeks
- Sub-second latency from MongoDB to your destination
- Zero infrastructure to provision or manage
- Automatic schema evolution for MongoDB’s flexible documents
- Exactly-once delivery guarantees
- Up to 90% cost reduction compared to traditional ETL
- Support for MongoDB Atlas (M10+, all regions) and self-hosted deployments
Whether you’re streaming to Snowflake, ClickHouse, Kafka, or any other destination, Streamkap connects your MongoDB data to the rest of your stack with the reliability and performance your business demands.
Ready to see it in action? Start your free trial and get MongoDB CDC running in minutes—no credit card, no infrastructure, no complexity.