<--- Back to all resources
Getting Started with CDC: Your First Real-Time Data Pipeline
A practical, beginner-friendly guide to change data capture (CDC). Learn what CDC is, how it works under the hood with WAL, binlog, and oplog, and how to build your first real-time data pipeline.
If you have ever waited hours for a batch ETL job to finish only to discover stale data in your warehouse, you already know the problem that change data capture (CDC) solves. CDC gives you a continuous stream of every insert, update, and delete happening in your source database, delivered in near real time. This guide walks you through the fundamentals, explains how CDC works at the log level, and helps you stand up your first pipeline.
What Is Change Data Capture?
At its core, CDC is a pattern for detecting and propagating data changes. Instead of running periodic SELECT * queries against your tables and diffing the results, CDC taps directly into the mechanism your database already uses to guarantee durability: its transaction log.
Every modern database writes changes to a log before applying them to the actual data files. This log exists for crash recovery. If the server loses power mid-transaction, it replays the log on startup to restore a consistent state. CDC tools simply attach to this same log and read changes as they appear.
The result is a stream of events that looks something like this:
{
"op": "u",
"before": { "id": 42, "status": "pending", "amount": 100.00 },
"after": { "id": 42, "status": "shipped", "amount": 100.00 },
"source": {
"table": "orders",
"ts_ms": 1740441600000
}
}
Each event tells you exactly what changed, what the row looked like before and after, and when it happened. Downstream consumers can use this information to update a search index, refresh a dashboard, populate a cache, or replicate the data into a warehouse.
How CDC Works Under the Hood
Understanding the log format your database uses is the first step toward building a reliable pipeline. The three most common log types you will encounter are PostgreSQL’s WAL, MySQL’s binlog, and MongoDB’s oplog.
PostgreSQL: Write-Ahead Log (WAL)
PostgreSQL writes every data modification to its write-ahead log before committing it to the heap files on disk. For CDC, PostgreSQL provides logical replication slots, which decode the WAL into a structured stream of changes.
To enable logical replication, you need two things:
- Set
wal_level = logicalinpostgresql.conf. This tells PostgreSQL to include enough information in the WAL for logical decoding. - Create a replication slot using a decoder plugin. The most common plugin is
pgoutput, which ships with PostgreSQL 10 and later.
-- Check your current wal_level
SHOW wal_level;
-- Create a replication slot (pgoutput is built-in)
SELECT pg_create_logical_replication_slot('streamkap_slot', 'pgoutput');
One thing to watch out for: replication slots retain WAL segments until a consumer reads them. If your CDC pipeline goes down for an extended period, WAL files accumulate on disk and can fill your storage. Monitoring slot lag is not optional.
MySQL: Binary Log (binlog)
MySQL records changes in its binary log, which serves both replication and point-in-time recovery. For CDC, you need row-based logging, which captures the actual row values rather than the SQL statements that produced them.
Key settings in my.cnf:
[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
Setting binlog_row_image = FULL ensures that both the before and after images of each row are present in the log. Without this, update events will be missing the “before” state, which makes it impossible to detect exactly which columns changed.
The CDC connector then poses as a MySQL replica, connecting to the primary and reading binlog events through the standard replication protocol. This is why you need to grant REPLICATION SLAVE and REPLICATION CLIENT privileges to the CDC user.
MongoDB: Oplog
MongoDB’s operation log (oplog) is a capped collection in the local database that records every write operation on a replica set. Unlike the relational databases above, MongoDB’s oplog is already structured as a collection of documents, which makes it relatively straightforward to consume.
CDC tools open a change stream cursor against the oplog, filtering for the namespaces (databases and collections) they care about. MongoDB 3.6 introduced the Change Streams API, which provides a stable, resumable interface on top of the raw oplog.
// Example: opening a change stream in the MongoDB shell
const pipeline = [{ $match: { "ns.coll": "orders" } }];
const changeStream = db.getMongo().watch(pipeline);
while (changeStream.hasNext()) {
printjson(changeStream.next());
}
One practical note: change streams require a replica set. If you are running a standalone MongoDB instance for development, you will need to convert it to a single-node replica set before CDC will work.
Choosing Your Source Database
When you are setting up your first CDC pipeline, start with the database where stale data causes the most pain. For most teams, that is the primary transactional database backing the application. A few things to consider:
Access and permissions. You will need elevated privileges on the source database. For PostgreSQL, the CDC user needs REPLICATION and LOGIN roles plus SELECT on the tables you want to capture. For MySQL, you need REPLICATION SLAVE, REPLICATION CLIENT, and SELECT. Make sure your DBA or platform team is on board before you start.
Network connectivity. The CDC connector needs a stable, low-latency connection to the database. If your database is in a private VPC, you may need an SSH tunnel, VPC peering, or a PrivateLink connection. Intermittent connectivity causes the connector to disconnect and restart from its last checkpoint, which can introduce latency spikes.
Schema stability. CDC captures changes at the schema level. If your application team ships schema migrations multiple times a day, you will need a pipeline that handles schema changes gracefully. Some CDC tools propagate schema changes automatically, while others require manual intervention. Streamkap handles schema evolution out of the box, which removes a common source of pipeline failures.
Load characteristics. High-throughput OLTP databases with thousands of transactions per second produce a lot of CDC events. Make sure your downstream consumers can keep up. Starting with a lower-traffic table lets you validate the pipeline end to end before scaling up.
Setting Up Your First Pipeline
Let us walk through the practical steps. The exact commands vary depending on your tooling, but the general pattern is the same whether you are using Debezium, a managed platform like Streamkap, or another CDC tool.
Step 1: Configure the Source Database
Apply the settings described above for your database engine. For PostgreSQL, set wal_level = logical and create a replication slot. For MySQL, enable row-based binlog. For MongoDB, ensure you are running a replica set.
Restart the database if required. PostgreSQL and MySQL both require a restart to change the logging configuration. Plan for a maintenance window if this is a production system.
Step 2: Create a CDC User
Create a dedicated database user for the CDC connector. Do not reuse your application’s credentials. A dedicated user makes it easier to audit access, set resource limits, and revoke permissions if needed.
-- PostgreSQL example
CREATE USER streamkap_cdc WITH REPLICATION LOGIN PASSWORD 'a-strong-password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO streamkap_cdc;
-- MySQL example
CREATE USER 'streamkap_cdc'@'%' IDENTIFIED BY 'a-strong-password';
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'streamkap_cdc'@'%';
FLUSH PRIVILEGES;
Step 3: Set Up the Connector
If you are running Debezium yourself, this means deploying a Kafka Connect cluster, installing the appropriate connector plugin, and submitting a connector configuration via the REST API. A minimal PostgreSQL connector config looks like this:
{
"name": "orders-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db.example.com",
"database.port": "5432",
"database.user": "streamkap_cdc",
"database.password": "a-strong-password",
"database.dbname": "app_production",
"table.include.list": "public.orders",
"topic.prefix": "app",
"plugin.name": "pgoutput",
"slot.name": "streamkap_slot"
}
}
If you are using a managed platform like Streamkap, the setup is considerably simpler. You provide the connection details and select the tables you want to capture. The platform handles provisioning the connector, managing the underlying Kafka topics, and delivering data to your destination.
Step 4: Configure the Destination
Your CDC events need somewhere to go. Common destinations include data warehouses (Snowflake, BigQuery, ClickHouse), data lakes (S3, GCS), search engines (Elasticsearch, OpenSearch), and caches (Redis).
For a first pipeline, pick the destination your team already uses for analytics. This minimizes the number of new moving parts and lets you focus on getting the CDC source working correctly.
Step 5: Start the Pipeline and Monitor
Start the connector and watch the logs. The first thing that happens is the initial snapshot: the connector reads the current state of each selected table and produces a baseline set of events. After the snapshot completes, the connector switches to streaming mode and begins tailing the transaction log for new changes.
Key things to watch during the initial run:
- Snapshot progress. Large tables can take hours to snapshot. Monitor the connector’s metrics or logs to see which table it is currently reading.
- Replication slot lag. For PostgreSQL, check
pg_stat_replicationto make sure the slot is being consumed and WAL is not accumulating. - Consumer lag. If you are using Kafka as the transport layer, monitor consumer group lag to ensure your destination sink is keeping up.
Verifying Data Flow
Once events are flowing, you need to verify that the data arriving at your destination matches the source. Here is a simple verification workflow:
- Insert a test row in the source table with a known, unique value.
- Check the destination for that row within a few seconds. If the pipeline is healthy, the row should appear almost immediately.
- Update the test row and verify that the change propagates.
- Delete the test row and confirm that the delete event is handled correctly by the destination. Some destinations soft-delete (mark as deleted), while others hard-delete the row.
For ongoing monitoring, compare row counts between source and destination on a regular schedule. A simple SQL query on both sides can catch drift early:
-- Run on both source and destination
SELECT COUNT(*) FROM orders WHERE updated_at > NOW() - INTERVAL '1 hour';
If counts diverge, check for connector restarts, schema changes, or filtering rules that might be dropping events.
Common Pitfalls and How to Avoid Them
Forgetting to monitor replication slot lag. This is the number one cause of production incidents with PostgreSQL CDC. If the connector goes down, WAL files pile up, and your database disk fills. Set up alerts on slot lag and WAL disk usage.
Using statement-based binlog in MySQL. If binlog_format is set to STATEMENT or MIXED, the CDC connector will not receive row-level changes and will fail or produce incorrect data. Always use ROW.
Ignoring schema changes. Adding or removing columns can break CDC connectors that are not built to handle schema evolution. Test schema changes in a staging environment before applying them to production. Or use a platform like Streamkap that propagates schema changes automatically.
Running too many tables in a single connector. While it is technically possible to capture hundreds of tables in one connector, this creates a single point of failure and makes it harder to diagnose issues. Start with one or two tables, validate the pipeline, and then expand.
Not planning for the initial snapshot. The first time a CDC connector starts, it reads the entire current state of each table. For large tables (hundreds of millions of rows), this can take hours and place significant read load on the source. Schedule initial snapshots during low-traffic periods.
Scaling Beyond Your First Pipeline
Once your first pipeline is running reliably, you will naturally want to capture more tables, more databases, and more source types. Here are a few things that change as you scale:
Connector management. With dozens of connectors, you need a way to monitor health, restart failed tasks, and manage configuration across environments. Doing this with raw Kafka Connect REST calls gets tedious quickly.
Schema registry. As you add more tables, managing Avro or JSON schemas for each topic becomes important. A schema registry prevents consumers from breaking when schemas change.
Transformation logic. Raw CDC events are not always in the shape your destination needs. You may need to filter columns, mask sensitive fields, rename tables, or flatten nested structures. This is where stream processing (Kafka Streams, Flink, or built-in SMTs) comes in.
Operational overhead. Running Debezium, Kafka, Schema Registry, and Kafka Connect in production is a full-time job. Many teams find that the operational cost of self-hosting exceeds the engineering time they save by having real-time data. Streamkap was built to handle this entire stack as a managed service, so your team can focus on using the data rather than babysitting the infrastructure.
Where to Go from Here
You now have the mental model for how CDC works, the practical steps to set up a pipeline, and the verification techniques to make sure data is flowing correctly. The best next step is to pick one table, follow the steps above, and get events flowing to your destination. Nothing teaches CDC like seeing your first real-time event arrive in your warehouse seconds after you modify a row in your source database.
From there, explore topics like stream processing for in-flight transformations, CDC for event-driven microservices, and multi-region replication patterns. The foundation you build with your first pipeline is the same foundation that scales to hundreds of tables and billions of events per day.