<--- Back to all resources

Engineering

February 25, 2026

11 min read

CDC to Kafka: Building an Event Backbone from Database Changes

Use CDC to publish database changes as Kafka events, creating an event-driven backbone for microservices, analytics, and real-time applications without changing application code.

TL;DR: • CDC-to-Kafka turns your database into an event source. Every INSERT, UPDATE, and DELETE becomes a Kafka event that any downstream consumer can subscribe to. • This pattern enables event-driven architecture without rewriting applications or adding event publishing code. The database transaction log IS the event stream. • Topic design (one topic per table vs custom routing), serialization format (Avro vs JSON vs Protobuf), and schema registry integration are the key design decisions.

Most teams arrive at event-driven architecture the hard way. They bolt event publishing into application code, wrestle with dual-write failures, and end up maintaining a fragile mesh of producers scattered across dozens of services. There is a simpler path: treat your database as the event source it already is. Every transaction that commits to your database writes to its transaction log, and Change Data Capture (CDC) can read that log and publish each change as a Kafka event. The result is an event backbone that every downstream system can subscribe to, built without modifying a single line of application code.

This guide walks through the architecture, design decisions, and practical patterns for building a CDC-to-Kafka event backbone.

Why CDC-to-Kafka Instead of Application-Level Events

There are several ways to get database changes into Kafka. Each comes with trade-offs.

Application-level event publishing requires every service that writes to the database to also publish a corresponding event to Kafka. This creates a dual-write problem: you must ensure both the database write and the Kafka publish succeed atomically. If the database commits but the Kafka write fails, your event stream diverges from reality. You can mitigate this with retries and idempotency, but the complexity adds up across every writing service.

The transactional outbox pattern solves dual writes by writing the event to an outbox table within the same database transaction. A separate process then polls the outbox and publishes to Kafka. This guarantees consistency but introduces polling latency, requires an outbox table per service, and still needs custom publishing infrastructure.

Database triggers can fire on row changes and call external systems, but triggers run synchronously within the transaction, adding latency. They are also notoriously difficult to maintain, debug, and scale.

CDC from the transaction log sidesteps all of these problems. It reads the write-ahead log (WAL in PostgreSQL, binlog in MySQL, oplog in MongoDB) after the transaction commits. There is no dual-write risk because the database transaction is the single source of truth. There are no application code changes. There is no polling delay; changes are streamed continuously. The transaction log already exists; CDC simply makes it accessible.

CDC turns your existing database into a real-time event source. The transaction log captures every INSERT, UPDATE, and DELETE. CDC reads it and Kafka distributes it.

Architecture Overview

The CDC-to-Kafka architecture has four layers:

┌─────────────┐     ┌─────────┐     ┌─────────────┐     ┌──────────────────┐
│  Source DB   │────▶│   CDC   │────▶│    Kafka     │────▶│   Consumers      │
│ (PostgreSQL, │     │ Capture │     │   Topics     │     │ (Analytics,      │
│  MySQL,      │     │         │     │              │     │  Search, Cache,  │
│  MongoDB)    │     └─────────┘     └─────────────┘     │  Microservices)  │
└─────────────┘                                          └──────────────────┘

Source database - Your operational database. CDC reads from its transaction log using a replication slot (PostgreSQL), binlog reader (MySQL), or change stream (MongoDB). The database continues serving application traffic with minimal overhead.

CDC capture - The process that reads log events and converts them into structured change records. Each record contains the operation type (insert, update, delete), the full row data (or before/after images for updates), and metadata like the transaction ID, timestamp, and source table.

Kafka topics - The durable, partitioned event log. Topics act as the distribution backbone, decoupling the source database from all downstream consumers. Each topic retains events for a configurable period, allowing consumers to reprocess historical data.

Consumers - Any system that subscribes to one or more topics. A single stream of order changes can simultaneously feed an analytics warehouse, update a search index, invalidate a cache, and trigger notifications. Each consumer group tracks its own offset independently.

Topic Design

How you organize CDC data into Kafka topics has lasting implications for ordering, scalability, and operational simplicity.

One Topic Per Table

The most common pattern maps each source table to a dedicated Kafka topic:

database.public.orders      → topic: orders
database.public.customers   → topic: customers
database.public.products    → topic: products

This approach provides clear ownership (one table, one topic), straightforward access control, and independent retention policies per table. Consumers subscribe only to the tables they care about.

Naming Conventions

Adopt a consistent naming scheme that encodes origin:

<environment>.<database>.<schema>.<table>

For example: prod.ecommerce.public.orders. This prevents collisions when multiple databases feed the same Kafka cluster and makes it easy to apply ACLs by prefix.

Partitioning by Primary Key

Use the table’s primary key as the Kafka message key. This guarantees that all changes for a given entity (e.g., order ID 12345) land in the same partition and are processed in order:

Key:   order_id = 12345
Value: { "op": "update", "after": { "order_id": 12345, "status": "shipped", ... } }

With the primary key as the message key, a consumer processing partition 7 will see every change for order 12345 in the exact sequence it was committed. Without a key, changes for the same order could land on different partitions and arrive out of order.

Partition Count

Start with a partition count that matches your expected consumer parallelism. A topic with 12 partitions can support up to 12 concurrent consumers in a single consumer group. You can increase partitions later, but be aware that changing the partition count redistributes keys across partitions, which temporarily breaks ordering for in-flight data.

Serialization Format

The format you choose for CDC Kafka messages affects payload size, schema safety, and consumer development experience.

FeatureAvroJSONProtobuf
Payload sizeCompact binaryVerbose textCompact binary
Schema enforcementRequired (via Schema Registry)None by defaultRequired (.proto files)
Schema evolutionStrong (backward/forward/full)Manual versioningStrong (field numbering)
Human readabilityRequires deserializationNatively readableRequires deserialization
CDC ecosystem supportExcellent (Debezium, Confluent, Streamkap)Good (universal)Growing
Best forProduction CDC pipelinesPrototyping, debugginggRPC-heavy environments

Avro with Schema Registry is the standard choice for CDC-to-Kafka pipelines. It provides compact encoding, mandatory schema definitions, and compatibility enforcement that prevents producers from publishing breaking changes. Most CDC tools, including Debezium and Streamkap, support Avro serialization natively.

JSON is a reasonable starting point for prototyping or small-scale pipelines where simplicity matters more than efficiency. However, without schema enforcement, consumers must defensively handle unexpected field changes.

Protobuf delivers binary efficiency and strong schema evolution but is less commonly used in the CDC ecosystem compared to Avro.

Schema Registry

A schema registry stores and versions the schemas for your Kafka messages. When a producer serializes a message, it registers the schema and embeds a schema ID in the message header. Consumers use that ID to fetch the correct schema for deserialization. This decouples producers and consumers while guaranteeing they agree on the data structure.

Compatibility Modes

Schema Registry enforces compatibility rules that control how schemas can evolve:

  • Backward compatible - New schema can read data written with the previous schema. You can add optional fields or remove fields that had defaults. This is the most common mode for CDC topics because consumers can upgrade independently.
  • Forward compatible - Old schema can read data written with the new schema. You can remove optional fields or add fields with defaults. Useful when producers upgrade before consumers.
  • Full compatible - Both backward and forward compatible simultaneously. The safest mode but the most restrictive: only adding or removing optional fields with defaults is allowed.

For CDC topics, backward compatibility is typically the right default. Database schema migrations that add nullable columns produce backward-compatible schema changes automatically. Dropping columns or changing data types requires careful coordination regardless of the compatibility mode.

Schema Evolution in Practice

When a source table adds a column, the CDC change record gains a new field. With Avro and backward compatibility, the updated schema is registered automatically. Existing consumers continue to work because Avro ignores unknown fields by default. New consumers see the full record including the new field. No downtime, no redeployment of existing consumers.

Log Compaction

Standard Kafka topics delete old messages based on time (retention.ms) or size (retention.bytes). Log compaction offers a different retention model: Kafka keeps the latest message per key and discards older duplicates during background compaction.

When to Enable Compaction for CDC Topics

Enable log compaction when consumers need the current state of each entity rather than the full history of changes. For example:

  • A cache that needs to be rebuilt from scratch should read only the latest version of each row.
  • A new microservice joining the ecosystem needs the current state of every customer record without replaying months of updates.

Configure compaction alongside time-based retention for maximum flexibility:

cleanup.policy=compact,delete
retention.ms=604800000          # 7 days of full history
min.compaction.lag.ms=3600000   # Keep at least 1 hour before compacting

This setup retains all changes for 7 days (useful for reprocessing), while log compaction ensures the latest state per key is always available even after retention expires.

Delete Events and Tombstones

When a row is deleted from the source database, CDC emits a delete event. For compacted topics, the CDC connector should also emit a tombstone, a message with the same key but a null value. Kafka’s compaction process uses tombstones to eventually remove the key entirely, preventing compacted topics from growing indefinitely with records for deleted entities.

Practical Example: E-Commerce Event Backbone

Consider an e-commerce platform with PostgreSQL as its operational database. The application manages orders, customers, products, and inventory across multiple services.

Without CDC-to-Kafka, each service that needs order data queries the orders database directly or maintains its own copy through batch ETL jobs. The search service re-indexes from a nightly dump. The analytics warehouse loads data every hour. The notification service polls for status changes.

With CDC-to-Kafka, the architecture shifts:

PostgreSQL ──CDC──▶ Kafka Topics:
                     ├── prod.ecommerce.public.orders
                     ├── prod.ecommerce.public.customers
                     ├── prod.ecommerce.public.products
                     └── prod.ecommerce.public.inventory

Consumers:
  ├── Analytics warehouse (Snowflake) ── consumer group: analytics
  ├── Search index (Elasticsearch)    ── consumer group: search
  ├── Cache layer (Redis)             ── consumer group: cache
  └── Notification service            ── consumer group: notifications

Every INSERT, UPDATE, and DELETE on the orders table becomes a Kafka event within seconds. The analytics warehouse ingests changes continuously instead of waiting for hourly batch loads. The search index updates in near real-time. The notification service triggers alerts the moment an order status changes. Each consumer group processes independently, so a slow analytics pipeline does not block real-time notifications.

Consumer Patterns

At-Least-Once Processing

Most CDC consumers operate with at-least-once delivery semantics. The consumer reads a batch of messages, processes them, and then commits the offset. If the consumer crashes after processing but before committing, it will re-read those messages on restart. This means consumer logic must be idempotent: processing the same message twice should produce the same result.

For database sinks, idempotency typically means using upserts (INSERT ON CONFLICT UPDATE) keyed on the primary key. For search indexes, re-indexing a document with the same ID is naturally idempotent.

Consumer Group Management

Assign a distinct consumer group ID to each downstream system. This ensures that every system receives every message independently:

group.id=analytics-warehouse
group.id=search-indexer
group.id=cache-invalidator
group.id=notification-service

Within each group, Kafka distributes partitions across available consumer instances. Scaling up means adding more instances to the group. Scaling down triggers a rebalance that redistributes partitions to the remaining instances.

Offset Handling

Consumers should commit offsets after successfully processing a batch, not before. Use manual offset management for critical pipelines:

while (true) {
    ConsumerRecords<String, GenericRecord> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, GenericRecord> record : records) {
        processChange(record);  // Apply the change downstream
    }
    consumer.commitSync();      // Commit only after successful processing
}

If your consumer writes to a system that supports transactions (like a database), you can store the Kafka offset alongside the processed data in the same transaction. On restart, query the stored offset and seek to that position, achieving effectively exactly-once processing without Kafka’s transactional API.

Self-Managed vs Managed Platforms

The Self-Managed Path

The open-source stack for CDC-to-Kafka typically involves Debezium running on Kafka Connect. This gives you full control but comes with significant operational responsibility:

  • Kafka cluster management - Broker provisioning, replication configuration, partition rebalancing, upgrades, and monitoring.
  • Kafka Connect cluster - Separate infrastructure to run and scale connectors, manage connector configurations, and handle failures.
  • Schema Registry - Another service to deploy, monitor, and back up.
  • Connector lifecycle - Handling snapshot completion, replication slot management, offset tracking, and schema change propagation.
  • Monitoring - Consumer lag, connector health, replication slot growth, and disk usage across all components.

Teams that go this route typically need at least one dedicated engineer to keep the pipeline running smoothly.

The Managed Path

Managed platforms like Streamkap handle the full CDC-to-Kafka pipeline as a service. The platform manages connector deployment, Kafka topic provisioning, schema registry integration, and monitoring. You configure the source database and destination, and the platform handles the rest, including automatic schema evolution, snapshot management, and exactly-once delivery guarantees.

This approach makes particular sense when your goal is the downstream value (real-time analytics, search, microservice events) rather than building expertise in Kafka operations. The event backbone pattern is powerful precisely because it is infrastructure that should be reliable and invisible, not a project in itself.

Getting Started

The CDC-to-Kafka pattern works best when you start focused and expand:

  1. Pick one high-value table - Choose a table with frequent writes whose changes matter to multiple downstream systems.
  2. Set up CDC capture - Connect to the database’s transaction log. Use Avro serialization with Schema Registry from the start.
  3. Create the topic - Use the primary key as the message key, set an appropriate partition count, and enable log compaction if consumers need current state.
  4. Add your first consumer - Start with the highest-value downstream system, whether that is analytics, search, or cache invalidation.
  5. Expand incrementally - Add more tables and more consumers as the pattern proves its value.

The beauty of this architecture is that adding a new consumer to an existing topic requires zero changes to the source database or the CDC pipeline. The event backbone is already there, and new systems simply subscribe.