<--- Back to all resources
CDC to Kafka: Building an Event Backbone from Database Changes
Use CDC to publish database changes as Kafka events, creating an event-driven backbone for microservices, analytics, and real-time applications without changing application code.
Most teams arrive at event-driven architecture the hard way. They bolt event publishing into application code, wrestle with dual-write failures, and end up maintaining a fragile mesh of producers scattered across dozens of services. There is a simpler path: treat your database as the event source it already is. Every transaction that commits to your database writes to its transaction log, and Change Data Capture (CDC) can read that log and publish each change as a Kafka event. The result is an event backbone that every downstream system can subscribe to, built without modifying a single line of application code.
This guide walks through the architecture, design decisions, and practical patterns for building a CDC-to-Kafka event backbone.
Why CDC-to-Kafka Instead of Application-Level Events
There are several ways to get database changes into Kafka. Each comes with trade-offs.
Application-level event publishing requires every service that writes to the database to also publish a corresponding event to Kafka. This creates a dual-write problem: you must ensure both the database write and the Kafka publish succeed atomically. If the database commits but the Kafka write fails, your event stream diverges from reality. You can mitigate this with retries and idempotency, but the complexity adds up across every writing service.
The transactional outbox pattern solves dual writes by writing the event to an outbox table within the same database transaction. A separate process then polls the outbox and publishes to Kafka. This guarantees consistency but introduces polling latency, requires an outbox table per service, and still needs custom publishing infrastructure.
Database triggers can fire on row changes and call external systems, but triggers run synchronously within the transaction, adding latency. They are also notoriously difficult to maintain, debug, and scale.
CDC from the transaction log sidesteps all of these problems. It reads the write-ahead log (WAL in PostgreSQL, binlog in MySQL, oplog in MongoDB) after the transaction commits. There is no dual-write risk because the database transaction is the single source of truth. There are no application code changes. There is no polling delay; changes are streamed continuously. The transaction log already exists; CDC simply makes it accessible.
CDC turns your existing database into a real-time event source. The transaction log captures every INSERT, UPDATE, and DELETE. CDC reads it and Kafka distributes it.
Architecture Overview
The CDC-to-Kafka architecture has four layers:
┌─────────────┐ ┌─────────┐ ┌─────────────┐ ┌──────────────────┐
│ Source DB │────▶│ CDC │────▶│ Kafka │────▶│ Consumers │
│ (PostgreSQL, │ │ Capture │ │ Topics │ │ (Analytics, │
│ MySQL, │ │ │ │ │ │ Search, Cache, │
│ MongoDB) │ └─────────┘ └─────────────┘ │ Microservices) │
└─────────────┘ └──────────────────┘
Source database - Your operational database. CDC reads from its transaction log using a replication slot (PostgreSQL), binlog reader (MySQL), or change stream (MongoDB). The database continues serving application traffic with minimal overhead.
CDC capture - The process that reads log events and converts them into structured change records. Each record contains the operation type (insert, update, delete), the full row data (or before/after images for updates), and metadata like the transaction ID, timestamp, and source table.
Kafka topics - The durable, partitioned event log. Topics act as the distribution backbone, decoupling the source database from all downstream consumers. Each topic retains events for a configurable period, allowing consumers to reprocess historical data.
Consumers - Any system that subscribes to one or more topics. A single stream of order changes can simultaneously feed an analytics warehouse, update a search index, invalidate a cache, and trigger notifications. Each consumer group tracks its own offset independently.
Topic Design
How you organize CDC data into Kafka topics has lasting implications for ordering, scalability, and operational simplicity.
One Topic Per Table
The most common pattern maps each source table to a dedicated Kafka topic:
database.public.orders → topic: orders
database.public.customers → topic: customers
database.public.products → topic: products
This approach provides clear ownership (one table, one topic), straightforward access control, and independent retention policies per table. Consumers subscribe only to the tables they care about.
Naming Conventions
Adopt a consistent naming scheme that encodes origin:
<environment>.<database>.<schema>.<table>
For example: prod.ecommerce.public.orders. This prevents collisions when multiple databases feed the same Kafka cluster and makes it easy to apply ACLs by prefix.
Partitioning by Primary Key
Use the table’s primary key as the Kafka message key. This guarantees that all changes for a given entity (e.g., order ID 12345) land in the same partition and are processed in order:
Key: order_id = 12345
Value: { "op": "update", "after": { "order_id": 12345, "status": "shipped", ... } }
With the primary key as the message key, a consumer processing partition 7 will see every change for order 12345 in the exact sequence it was committed. Without a key, changes for the same order could land on different partitions and arrive out of order.
Partition Count
Start with a partition count that matches your expected consumer parallelism. A topic with 12 partitions can support up to 12 concurrent consumers in a single consumer group. You can increase partitions later, but be aware that changing the partition count redistributes keys across partitions, which temporarily breaks ordering for in-flight data.
Serialization Format
The format you choose for CDC Kafka messages affects payload size, schema safety, and consumer development experience.
| Feature | Avro | JSON | Protobuf |
|---|---|---|---|
| Payload size | Compact binary | Verbose text | Compact binary |
| Schema enforcement | Required (via Schema Registry) | None by default | Required (.proto files) |
| Schema evolution | Strong (backward/forward/full) | Manual versioning | Strong (field numbering) |
| Human readability | Requires deserialization | Natively readable | Requires deserialization |
| CDC ecosystem support | Excellent (Debezium, Confluent, Streamkap) | Good (universal) | Growing |
| Best for | Production CDC pipelines | Prototyping, debugging | gRPC-heavy environments |
Avro with Schema Registry is the standard choice for CDC-to-Kafka pipelines. It provides compact encoding, mandatory schema definitions, and compatibility enforcement that prevents producers from publishing breaking changes. Most CDC tools, including Debezium and Streamkap, support Avro serialization natively.
JSON is a reasonable starting point for prototyping or small-scale pipelines where simplicity matters more than efficiency. However, without schema enforcement, consumers must defensively handle unexpected field changes.
Protobuf delivers binary efficiency and strong schema evolution but is less commonly used in the CDC ecosystem compared to Avro.
Schema Registry
A schema registry stores and versions the schemas for your Kafka messages. When a producer serializes a message, it registers the schema and embeds a schema ID in the message header. Consumers use that ID to fetch the correct schema for deserialization. This decouples producers and consumers while guaranteeing they agree on the data structure.
Compatibility Modes
Schema Registry enforces compatibility rules that control how schemas can evolve:
- Backward compatible - New schema can read data written with the previous schema. You can add optional fields or remove fields that had defaults. This is the most common mode for CDC topics because consumers can upgrade independently.
- Forward compatible - Old schema can read data written with the new schema. You can remove optional fields or add fields with defaults. Useful when producers upgrade before consumers.
- Full compatible - Both backward and forward compatible simultaneously. The safest mode but the most restrictive: only adding or removing optional fields with defaults is allowed.
For CDC topics, backward compatibility is typically the right default. Database schema migrations that add nullable columns produce backward-compatible schema changes automatically. Dropping columns or changing data types requires careful coordination regardless of the compatibility mode.
Schema Evolution in Practice
When a source table adds a column, the CDC change record gains a new field. With Avro and backward compatibility, the updated schema is registered automatically. Existing consumers continue to work because Avro ignores unknown fields by default. New consumers see the full record including the new field. No downtime, no redeployment of existing consumers.
Log Compaction
Standard Kafka topics delete old messages based on time (retention.ms) or size (retention.bytes). Log compaction offers a different retention model: Kafka keeps the latest message per key and discards older duplicates during background compaction.
When to Enable Compaction for CDC Topics
Enable log compaction when consumers need the current state of each entity rather than the full history of changes. For example:
- A cache that needs to be rebuilt from scratch should read only the latest version of each row.
- A new microservice joining the ecosystem needs the current state of every customer record without replaying months of updates.
Configure compaction alongside time-based retention for maximum flexibility:
cleanup.policy=compact,delete
retention.ms=604800000 # 7 days of full history
min.compaction.lag.ms=3600000 # Keep at least 1 hour before compacting
This setup retains all changes for 7 days (useful for reprocessing), while log compaction ensures the latest state per key is always available even after retention expires.
Delete Events and Tombstones
When a row is deleted from the source database, CDC emits a delete event. For compacted topics, the CDC connector should also emit a tombstone, a message with the same key but a null value. Kafka’s compaction process uses tombstones to eventually remove the key entirely, preventing compacted topics from growing indefinitely with records for deleted entities.
Practical Example: E-Commerce Event Backbone
Consider an e-commerce platform with PostgreSQL as its operational database. The application manages orders, customers, products, and inventory across multiple services.
Without CDC-to-Kafka, each service that needs order data queries the orders database directly or maintains its own copy through batch ETL jobs. The search service re-indexes from a nightly dump. The analytics warehouse loads data every hour. The notification service polls for status changes.
With CDC-to-Kafka, the architecture shifts:
PostgreSQL ──CDC──▶ Kafka Topics:
├── prod.ecommerce.public.orders
├── prod.ecommerce.public.customers
├── prod.ecommerce.public.products
└── prod.ecommerce.public.inventory
Consumers:
├── Analytics warehouse (Snowflake) ── consumer group: analytics
├── Search index (Elasticsearch) ── consumer group: search
├── Cache layer (Redis) ── consumer group: cache
└── Notification service ── consumer group: notifications
Every INSERT, UPDATE, and DELETE on the orders table becomes a Kafka event within seconds. The analytics warehouse ingests changes continuously instead of waiting for hourly batch loads. The search index updates in near real-time. The notification service triggers alerts the moment an order status changes. Each consumer group processes independently, so a slow analytics pipeline does not block real-time notifications.
Consumer Patterns
At-Least-Once Processing
Most CDC consumers operate with at-least-once delivery semantics. The consumer reads a batch of messages, processes them, and then commits the offset. If the consumer crashes after processing but before committing, it will re-read those messages on restart. This means consumer logic must be idempotent: processing the same message twice should produce the same result.
For database sinks, idempotency typically means using upserts (INSERT ON CONFLICT UPDATE) keyed on the primary key. For search indexes, re-indexing a document with the same ID is naturally idempotent.
Consumer Group Management
Assign a distinct consumer group ID to each downstream system. This ensures that every system receives every message independently:
group.id=analytics-warehouse
group.id=search-indexer
group.id=cache-invalidator
group.id=notification-service
Within each group, Kafka distributes partitions across available consumer instances. Scaling up means adding more instances to the group. Scaling down triggers a rebalance that redistributes partitions to the remaining instances.
Offset Handling
Consumers should commit offsets after successfully processing a batch, not before. Use manual offset management for critical pipelines:
while (true) {
ConsumerRecords<String, GenericRecord> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, GenericRecord> record : records) {
processChange(record); // Apply the change downstream
}
consumer.commitSync(); // Commit only after successful processing
}
If your consumer writes to a system that supports transactions (like a database), you can store the Kafka offset alongside the processed data in the same transaction. On restart, query the stored offset and seek to that position, achieving effectively exactly-once processing without Kafka’s transactional API.
Self-Managed vs Managed Platforms
The Self-Managed Path
The open-source stack for CDC-to-Kafka typically involves Debezium running on Kafka Connect. This gives you full control but comes with significant operational responsibility:
- Kafka cluster management - Broker provisioning, replication configuration, partition rebalancing, upgrades, and monitoring.
- Kafka Connect cluster - Separate infrastructure to run and scale connectors, manage connector configurations, and handle failures.
- Schema Registry - Another service to deploy, monitor, and back up.
- Connector lifecycle - Handling snapshot completion, replication slot management, offset tracking, and schema change propagation.
- Monitoring - Consumer lag, connector health, replication slot growth, and disk usage across all components.
Teams that go this route typically need at least one dedicated engineer to keep the pipeline running smoothly.
The Managed Path
Managed platforms like Streamkap handle the full CDC-to-Kafka pipeline as a service. The platform manages connector deployment, Kafka topic provisioning, schema registry integration, and monitoring. You configure the source database and destination, and the platform handles the rest, including automatic schema evolution, snapshot management, and exactly-once delivery guarantees.
This approach makes particular sense when your goal is the downstream value (real-time analytics, search, microservice events) rather than building expertise in Kafka operations. The event backbone pattern is powerful precisely because it is infrastructure that should be reliable and invisible, not a project in itself.
Getting Started
The CDC-to-Kafka pattern works best when you start focused and expand:
- Pick one high-value table - Choose a table with frequent writes whose changes matter to multiple downstream systems.
- Set up CDC capture - Connect to the database’s transaction log. Use Avro serialization with Schema Registry from the start.
- Create the topic - Use the primary key as the message key, set an appropriate partition count, and enable log compaction if consumers need current state.
- Add your first consumer - Start with the highest-value downstream system, whether that is analytics, search, or cache invalidation.
- Expand incrementally - Add more tables and more consumers as the pattern proves its value.
The beauty of this architecture is that adding a new consumer to an existing topic requires zero changes to the source database or the CDC pipeline. The event backbone is already there, and new systems simply subscribe.