<--- Back to all resources

Engineering

February 25, 2026

11 min read

Self-Managed Debezium: The Operational Reality of DIY CDC

Debezium is the best open-source CDC tool available. It's also a full-time job to run in production. Here's what you'll actually deal with when you self-manage Debezium and Kafka Connect.

TL;DR: Running Debezium in production means managing Kafka Connect clusters, handling connector failures and restarts, dealing with PostgreSQL replication slot bloat, tuning snapshot performance, managing schema changes, and debugging deserialization errors at 3am. It's the most underestimated operational burden in the CDC space.

Debezium is the best open-source Change Data Capture tool available today. That is not a controversial statement. The project has first-class connectors for PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and others. It reads database transaction logs directly, which means low overhead on the source database and true row-level change events. The engineering is solid, the community is active, and the documentation is better than most commercial products.

None of that changes the fact that running Debezium in production is a significant operational commitment. The problem is not Debezium itself. The problem is everything Debezium needs around it to actually work - and everything that can go wrong once you have data flowing.

This article is for teams that are evaluating self-managed Debezium or already running it and recognizing the operational weight. We use Debezium ourselves. We know what it takes.

The Kafka Connect Dependency

Debezium’s primary deployment model runs as a set of Kafka Connect source connectors. To run Kafka Connect, you need Apache Kafka. To run Kafka, you historically needed ZooKeeper (now KRaft, but many production clusters are still on ZooKeeper). So before you capture a single change event from your database, you are already managing three distributed systems.

This is not a criticism of the architecture. Kafka Connect is a well-designed framework for running connectors, and Kafka provides the durable, ordered log that makes CDC reliable. But the operational surface area is real. Each system has its own configuration, its own failure modes, its own tuning parameters, and its own upgrade path. You need to understand Kafka broker configuration, topic partitioning, replication factors, retention policies, and consumer group management - all before you think about the CDC part.

Teams that set out to “just stream changes from Postgres to Snowflake” often discover three months later that they have become a Kafka operations team that also happens to do CDC.

Connector Lifecycle Management

A Debezium connector running on Kafka Connect has a state machine: RUNNING, PAUSED, FAILED, UNASSIGNED. In theory, you deploy a connector, it starts reading the transaction log, and it runs. In practice, connectors fail. They fail for all sorts of reasons - database connection timeouts, replication slot issues, out-of-memory errors during snapshots, schema changes that the connector cannot handle, network partitions between Connect workers and Kafka brokers.

When a connector task transitions to FAILED, Kafka Connect does not automatically restart it. The connector sits in FAILED state until someone notices and issues a restart via the REST API. If you do not have monitoring on connector task status - and many teams do not, at least initially - the connector can sit dead for hours or days. During that time, no changes are captured, and depending on your source database, the replication slot or binlog position may become invalid.

Kafka Connect also rebalances connector tasks across workers when workers join or leave the cluster. This rebalancing stops all connectors on the cluster temporarily, which means even a routine worker restart can interrupt every CDC pipeline you are running. The incremental cooperative rebalancing protocol improved this, but it is still a disruption that you need to understand and plan around.

PostgreSQL Replication Slots: The Silent Database Killer

If there is one thing that makes PostgreSQL CDC dangerous when self-managed, it is replication slots. Debezium uses a logical replication slot to read changes from the PostgreSQL write-ahead log (WAL). The slot tells PostgreSQL which WAL segments have been consumed by the connector. PostgreSQL will not delete WAL segments that the slot has not yet acknowledged.

Here is the problem: if the Debezium connector goes down and stays down, the replication slot keeps accumulating WAL. PostgreSQL cannot reclaim that disk space. The WAL grows and grows until the disk fills up. When the disk fills up, PostgreSQL stops accepting writes. Your production database is now down - not because of a database issue, but because a CDC connector in a separate system failed and nobody restarted it.

This is not a theoretical risk. It happens. Teams learn about replication slot bloat the hard way, usually at 2am when their application starts throwing errors because the database cannot write to disk. The fix is straightforward - drop the replication slot, free the WAL - but by that point you have lost your position in the change stream and need to re-snapshot the affected tables.

Monitoring pg_replication_slots and the WAL size should be the first thing you set up when running Debezium against PostgreSQL. Setting max_slot_wal_keep_size in PostgreSQL 13+ provides a safety valve, but it means the slot gets invalidated if the connector falls too far behind, which brings its own recovery challenges.

MySQL Binlog Retention

MySQL has a different version of the same problem. Debezium reads from the MySQL binlog, tracking its position using binlog file names and offsets (or GTID sets). MySQL purges old binlog files based on binlog_expire_logs_seconds (or the older expire_logs_days). If the Debezium connector stops for longer than your retention period, the binlog files it needs are gone.

When the connector restarts and discovers its position no longer exists in the binlog, it cannot resume. The error messages will reference binlog files that no longer exist on the server. Your options at that point are to re-snapshot the affected tables or manually reset the connector’s offsets to the current binlog position (accepting that you will miss any changes that occurred during the gap).

GTID-based replication helps somewhat because it is position-independent, but GTID gaps - where the server has purged transactions that the connector has not consumed - produce the same result. The connector cannot resume, and you are back to a re-snapshot.

Teams running MySQL with Debezium need to carefully align their binlog retention with their maximum acceptable connector downtime. That is an operational decision that requires understanding both systems deeply.

The Snapshot Problem

When you first set up a Debezium connector, or when you add new tables to an existing connector, Debezium needs to snapshot the current state of those tables before it can start streaming changes. For small tables, this is fast and uneventful. For tables with hundreds of millions of rows, it becomes a serious operational challenge.

The legacy snapshot mode locks tables during the snapshot (depending on isolation level and database), which can block writes on the source database. Even the newer incremental snapshot approach, which reads chunks of the table without holding locks for the entire duration, puts significant read load on the database. The snapshot generates a large volume of messages that flow through Kafka Connect and into Kafka, which can cause memory pressure on Connect workers and spike Kafka broker disk usage.

Snapshots of large tables can take hours or even days. During that time, the connector is also buffering streaming changes, which adds to memory pressure. If the Connect worker runs out of heap space and crashes during a snapshot, the snapshot restarts from the beginning. There is no checkpoint-and-resume for the legacy snapshot mode. Incremental snapshots handle this better, but they require Debezium’s signaling table to be set up correctly in the source database - another piece of configuration to manage.

Getting snapshots right for large tables usually involves tuning snapshot.fetch.size, adjusting JVM heap on Connect workers, and sometimes coordinating with database administrators to schedule snapshots during low-traffic periods. It is doable, but it is not “deploy and forget.”

Schema Changes Break Pipelines

Databases change. Columns get added, types get altered, tables get renamed. In a batch ETL world, you handle this in your transformation layer. In a streaming CDC world, schema changes propagate through the entire pipeline in real time, and anything that does not expect the new schema will break.

Debezium tracks schema changes and can propagate them through the Schema Registry (if you are using Avro or Protobuf serialization). But the Schema Registry has compatibility rules - backward, forward, full - and a schema change that violates your compatibility setting will cause the connector to fail with a serialization error. An ALTER TABLE that drops a column, for example, is not backward-compatible under the default setting, and the connector will stop producing events until you either change the compatibility setting or manually update the schema in the registry.

Downstream consumers have their own problems. A sink connector reading Avro messages from Kafka will throw deserialization errors if it encounters a schema it was not compiled against. A Snowflake sink might fail to evolve the target table if the column type change is not supported. Every schema change becomes a coordination event across the entire pipeline.

The worst version of this is when someone runs an ALTER TABLE on the source database without telling the data team. The pipeline breaks, the error message references an Avro schema compatibility violation, and you spend an hour tracing back to a column rename that happened during a routine application deployment. This is a people problem as much as a technology problem, but self-managed Debezium gives you no guardrails against it.

Monitoring: What You Need and What You Get

Running Debezium in production requires monitoring at multiple layers. You need to watch Kafka Connect worker health, individual connector and task status, connector lag (how far behind the connector is from the current database position), Kafka broker health, topic lag for consumers, and - critically - database-side metrics like replication slot size and WAL growth.

Kafka Connect exposes metrics via JMX, which means you need a JMX exporter to get those metrics into your monitoring system. Debezium-specific metrics like MilliSecondsBehindSource and NumberOfEventsFiltered are available via JMX as well, but require knowing which MBean paths to query. Building a thorough dashboard that covers the full stack - database replication health, connector status, Kafka throughput, consumer lag - is a project in itself.

Most teams start with basic connector status checks and learn over time which metrics actually matter. By the time the monitoring is thorough, they have already experienced at least one incident that could have been caught earlier.

Offset Management

Kafka Connect stores connector offsets in a Kafka topic (by default, connect-offsets). These offsets record where each connector has read up to in the source database’s transaction log. If those offsets are corrupted or lost, the connector does not know where to resume. Depending on your configuration, it may re-snapshot the entire database or simply fail to start.

Manual offset management - resetting a connector’s offset to a specific binlog position or LSN - requires writing directly to the offsets topic with the correct key format. The key format is a JSON structure that varies by connector type, and getting it wrong means the connector either ignores your override or starts from the wrong position. There is no built-in CLI tool for this in Kafka Connect. You use kafka-console-producer to write a tombstone record and then a new offset record, and you need to get the partition assignment right.

This is the kind of operation that is well-documented in blog posts and GitHub issues, but every team discovers it under pressure when a connector needs to be manually repositioned after an incident. Doing it correctly requires understanding both Kafka Connect’s offset storage model and Debezium’s position tracking format for the specific database you are using.

Error Messages That Do Not Help

Debezium’s error messages are a known pain point. When a connector task fails, the error you see in the Kafka Connect REST API is often a Java stack trace with a root cause buried several levels deep. Some common examples:

org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped. - This tells you almost nothing. The actual cause could be a database connection failure, a replication slot issue, a serialization error, or a dozen other things. You need to dig into the Connect worker logs to find the real cause.

io.debezium.DebeziumException: Could not execute heartbeat action - Usually means the connector lost its database connection, but could also indicate permission issues on the heartbeat table.

org.apache.kafka.connect.errors.DataException: Failed to serialize Avro data from topic - Schema Registry rejected the message, probably because a schema change violated the compatibility setting. The fix depends entirely on what the schema change was and how you want to handle it.

Reading Debezium logs effectively is a skill that takes time to develop. The signal-to-noise ratio is low at DEBUG level (which is where the useful information lives), and INFO level often does not include enough context to diagnose issues. Teams end up building internal runbooks that map common error patterns to root causes and fixes.

The Full Stack You Are Actually Managing

When people say they are “running Debezium,” what they are actually managing is:

  • Apache Kafka (3+ brokers for production) - message storage and distribution
  • ZooKeeper or KRaft (3 nodes) - cluster coordination
  • Kafka Connect (2+ workers) - connector runtime
  • Schema Registry (1+ instances) - schema management
  • Debezium connectors - the actual CDC logic
  • Monitoring stack - JMX exporters, dashboards, alerting rules

That is six systems minimum, each with its own operational requirements, upgrade cadence, and failure modes. The Kafka ecosystem is powerful precisely because it is composed of focused, composable components. But composition has a cost, and that cost is operational complexity.

For a team with existing Kafka expertise, this stack is manageable. For a team that started with “we need to get database changes into our warehouse,” it is a steep climb.

When Self-Managed Debezium Makes Sense

There are legitimate reasons to run Debezium yourself. If your organization has a strict open-source-only policy, self-managed Debezium is the best CDC tool you can run. If you have an existing Kafka team that already manages Kafka and Kafka Connect for other workloads, adding Debezium connectors to that infrastructure is incremental rather than greenfield. If you need highly custom connector configurations or modifications that a managed platform would not support, running your own gives you full control.

Some teams also use Debezium as the foundation for event-driven architectures where Kafka is the central nervous system, not just a transport layer for CDC. In that context, the Kafka infrastructure serves multiple purposes and the operational cost is amortized across many use cases.

The question is whether your team’s time is best spent managing this infrastructure or using the data it produces.

The Managed Alternative

Managed CDC platforms exist specifically to collapse this operational stack. The idea is straightforward: you point the platform at your source database and your target destination, and the platform handles everything in between - Kafka, Connect, Schema Registry, connector lifecycle, monitoring, snapshot orchestration, and failure recovery.

Streamkap takes this approach. Under the hood, it runs Debezium connectors on managed Kafka Connect infrastructure backed by managed Kafka. Replication slot management, snapshot handling, schema evolution, connector restarts, and monitoring are all handled by the platform. You configure a source and a destination through the UI or API, and data flows.

This does not mean managed CDC is the right choice for everyone. If you need Kafka as a shared platform for multiple teams and workloads, managing it yourself (or using a managed Kafka service) may make more sense. But if your goal is to get database changes into a destination reliably and you do not want to become a Kafka operations team, the managed path eliminates the operational surface area described in this article.

The operational reality of self-managed Debezium is not a reason to avoid CDC. It is a reason to be honest about what you are taking on and to choose the deployment model that matches your team’s capacity and priorities. Debezium is excellent software. Running it well in production is a job.