<--- Back to all resources

Engineering

February 25, 2026

9 min read

CDC Soft Deletes and Tombstones: Handling Deletions in Streaming Pipelines

Learn how to handle database deletions in CDC streaming pipelines. Implement soft deletes, tombstone records, and delete propagation for data warehouses and analytics.

TL;DR: • When a row is deleted from the source database, CDC produces a delete event - but most analytics destinations are append-only or handle deletes differently than the source. • The two main patterns are soft deletes (marking rows as deleted with a flag column and timestamp) and hard deletes (physically removing the row from the destination). • Tombstone records in Kafka (null-value messages) serve as delete markers for log-compacted topics, ensuring downstream consumers know a key has been removed.

Every database operation falls into one of four categories: insert, update, delete, and read. Change Data Capture handles the first three by reading the database’s transaction log and producing events for each mutation. Inserts and updates are relatively straightforward - you have a row, it has columns, you write it to the destination. Deletes are different. A delete is the absence of data, and representing absence in a system designed to move data from one place to another turns out to be one of the most nuanced problems in data engineering.

The difficulty is not in detecting the delete. Modern CDC engines read delete events from the transaction log just as reliably as they read inserts and updates. The difficulty is in what happens after the delete is detected. The source database simply removes the row. But the destination - a data warehouse, a search index, a Kafka topic - may not support the same kind of removal. It may be append-only. It may need the row to remain for historical reporting. It may need a special marker to signal that the row no longer exists. Getting this right is essential for data correctness, compliance, and operational reliability.

This guide walks through the mechanics of CDC delete handling: how deletes are captured, the patterns for propagating them, and the trade-offs you need to understand when configuring your pipelines.

How CDC Captures Deletes

When a row is deleted from a source database, the database engine writes a delete entry to its transaction log - the WAL in PostgreSQL, the binlog in MySQL, the oplog in MongoDB. The CDC connector reading that log produces a delete event that contains, at minimum, the primary key of the deleted row.

Depending on the source database and the CDC connector’s configuration, the delete event may also include a before image: the full set of column values the row contained immediately prior to deletion. This is valuable because it allows downstream systems to know what was deleted, not just which key was deleted.

A typical CDC delete event in a Debezium-style envelope looks like this:

{
  "op": "d",
  "before": {
    "id": 4821,
    "email": "jane@example.com",
    "name": "Jane Doe",
    "created_at": "2024-03-15T10:22:00Z"
  },
  "after": null,
  "source": {
    "table": "users",
    "ts_ms": 1708892400000
  }
}

The "op": "d" field marks this as a delete. The before object contains the last known state of the row. The after field is null because the row no longer exists. This event is the starting point for every delete-handling strategy downstream.

Soft Deletes

Soft deletes are the most common pattern for analytics and data warehousing destinations. Instead of removing the row from the destination table, the pipeline marks it as deleted by setting metadata columns.

The two standard columns are:

  • _deleted --- a boolean flag set to true when the row is deleted
  • _deleted_at --- a timestamp recording when the deletion occurred

After a soft delete, the destination table might look like this:

SELECT id, email, name, _deleted, _deleted_at
FROM users
WHERE id = 4821;

-- Result:
-- id: 4821, email: jane@example.com, name: Jane Doe,
-- _deleted: true, _deleted_at: 2026-02-25T14:30:00Z

The row is still physically present. Analysts who need the full history - including deleted records - can query the table without any special handling. Analysts who want only active records add a filter:

SELECT * FROM users WHERE _deleted = false;

Advantages of Soft Deletes

Soft deletes preserve the complete history of every entity. This is critical for audit trails, time-travel queries, and scenarios where you need to understand what the data looked like at a specific point in time. They also make recovery straightforward: if a row was deleted accidentally, the data is still in the destination and can be “undeleted” by resetting the flag.

Soft deletes are also simpler to implement at the destination. The pipeline issues an UPDATE (or UPSERT) rather than a DELETE, which is a less disruptive operation for most analytical databases.

Disadvantages of Soft Deletes

The primary downside is query complexity. Every query that should exclude deleted rows must include a WHERE _deleted = false filter. If this is forgotten - and it will be forgotten - reports will include data from entities that no longer exist in the source. Over time, the accumulation of soft-deleted rows also increases storage costs and can degrade query performance, particularly on large tables with high churn.

Hard Deletes

Hard deletes physically remove the row from the destination. When the CDC pipeline receives a delete event, it executes a DELETE or MERGE statement against the destination table to remove the corresponding row.

-- Direct DELETE
DELETE FROM users WHERE id = 4821;

-- Or via MERGE (common in warehouse patterns)
MERGE INTO users AS target
USING staged_changes AS source
ON target.id = source.id
WHEN MATCHED AND source._op = 'DELETE' THEN DELETE;

Hard deletes produce a destination table that is a faithful mirror of the source: if a row does not exist in the source, it does not exist in the destination. This simplifies querying because there are no deleted rows to filter out, and it keeps storage costs in check.

The trade-off is that you lose history. Once a row is hard-deleted, the destination has no record that it ever existed. If you need to recover deleted data or answer questions about historical state, you need a separate mechanism - a data lake archive, a slowly changing dimension, or a backup.

Hard deletes also impose requirements on the destination. The database must support efficient point deletes by primary key. Append-only systems, or systems where deletes are expensive operations, are poor candidates for hard delete mode.

Tombstone Records in Kafka

When CDC events flow through Kafka before reaching their final destination, delete handling has an additional layer: tombstone records. A tombstone is a Kafka message with a valid key and a null value. It is Kafka’s native mechanism for signaling that a key has been removed.

Tombstones interact with Kafka’s log compaction feature. In a compacted topic, Kafka retains only the most recent message for each key. When a tombstone is encountered during compaction, Kafka removes all prior messages for that key and eventually removes the tombstone itself (after a configurable retention period controlled by delete.retention.ms).

Key: user-4821  Value: {"id": 4821, "email": "jane@example.com", ...}   // Latest state
Key: user-4821  Value: null                                              // Tombstone

After compaction, both messages are removed. Any consumer that reads the topic from the beginning will never see user 4821---which is the correct behavior for a key that has been deleted from the source.

Producer Configuration

CDC connectors that produce tombstones typically emit two messages for each delete event: first, the delete envelope (with "op": "d" and the before image), and second, a tombstone (same key, null value). This two-message pattern ensures that consumers processing the stream in real time receive the full delete context, while consumers reading a compacted topic after the fact see the correct final state.

The tombstone behavior is usually controlled by a connector-level configuration:

# Debezium connector configuration
tombstones.on.delete=true
delete.handling.mode=rewrite   # Adds __deleted field to the value

The delete.handling.mode=rewrite option adds a __deleted boolean to the message value, which allows sink connectors to implement soft deletes at the destination without relying on null-value detection.

Destination-Specific Patterns

Each destination has its own mechanics for processing deletes, and your pipeline configuration should account for these differences.

Snowflake handles deletes through MERGE statements. Streamkap stages incoming CDC events and executes a MERGE INTO ... WHEN MATCHED AND _op = 'DELETE' THEN DELETE (for hard deletes) or WHEN MATCHED AND _op = 'DELETE' THEN UPDATE SET _deleted = true (for soft deletes). Snowflake’s micro-partition architecture makes point deletes reasonably efficient, though very high delete volumes can increase the number of micro-partitions that need rewriting.

BigQuery uses a similar MERGE pattern. BigQuery’s columnar storage means that deletes are not instant - they are applied during the next storage optimization pass. For soft deletes, BigQuery is straightforward: the _deleted and _deleted_at columns are updated via the merge.

ClickHouse with ReplacingMergeTree handles deletes differently. Rather than issuing a DELETE statement, the pipeline inserts a new version of the row with a _deleted or _sign column set to indicate deletion. ClickHouse’s background merge process then collapses duplicate rows, and queries use FINAL or filtering to exclude deleted entries:

SELECT * FROM users FINAL WHERE _deleted = 0;

Elasticsearch supports delete-by-ID natively. When a CDC delete event arrives, the pipeline issues a DELETE /users/_doc/4821 request. Elasticsearch marks the document as deleted and removes it during the next segment merge. This is the closest to a true hard delete in the destination ecosystem.

GDPR and Right to Erasure

The EU’s General Data Protection Regulation includes Article 17, the “right to erasure,” which requires organizations to delete personal data upon request. In this context, soft deletes are explicitly not sufficient. A soft-deleted row that still contains an email address, name, or other personally identifiable information does not satisfy the erasure requirement. The data must be physically removed or irreversibly anonymized.

This means that pipelines handling personal data often need a dual strategy: soft deletes for analytical purposes (preserving row counts, aggregates, and historical trends) combined with a hard delete or anonymization pass for compliance.

A practical approach is to run soft deletes as the default mode for analytics destinations, and then execute a periodic compliance job that identifies rows where _deleted = true and _deleted_at is older than the legally required response window, and either hard-deletes those rows or replaces PII fields with anonymized values.

-- Anonymize soft-deleted rows older than 30 days
UPDATE users
SET email = 'REDACTED', name = 'REDACTED'
WHERE _deleted = true AND _deleted_at < DATEADD(day, -30, CURRENT_TIMESTAMP());

Practical Example: User Table with Dual Delete Handling

Consider a users table that feeds both an analytics warehouse and a compliance-sensitive operational store. The CDC pipeline captures a delete for user 4821.

For the analytics destination (Snowflake), the pipeline applies a soft delete. The row remains with _deleted = true and _deleted_at = '2026-02-25T14:30:00Z'. Analysts can still count total users ever registered, calculate churn rates, and run cohort analyses that include departed users.

For the compliance destination (a downstream service database), the pipeline applies a hard delete. The row is physically removed. A GDPR audit can confirm that no personal data for user 4821 exists in this system.

Streamkap supports this pattern by allowing you to configure delete handling mode independently per destination. The same CDC stream from the source can drive soft deletes in one destination and hard deletes in another, without requiring separate pipelines or custom transformation logic.

Cascading Deletes

In relational databases, deleting a parent row can trigger cascading deletes on child rows via foreign key constraints. For example, deleting a customer may cascade to delete all of that customer’s orders, addresses, and payment_methods.

CDC captures each of these deletes as separate events in the transaction log. However, there is no inherent grouping or ordering guarantee across tables when these events are published to Kafka topics. The orders delete might arrive at the destination before the customer delete, or vice versa.

This creates two challenges:

  1. Referential integrity at the destination: If the destination enforces foreign keys (rare in analytical databases, but common in operational replicas), deletes must be applied in the correct order - children before parents.

  2. Completeness: If the pipeline processes the customer delete but misses or delays one of the orders deletes, the destination ends up with orphaned child rows that reference a parent that no longer exists.

The most practical approach is to treat each table’s CDC stream independently and rely on the destination’s query layer to handle relationships. In analytical databases without enforced foreign keys, this is usually sufficient. For operational replicas with constraints, you may need to implement ordering logic or temporarily defer constraint checking during delete processing.

Monitoring Delete Rates

Deletes deserve dedicated monitoring in any CDC pipeline. Unexpected spikes in delete volume can indicate application bugs (a runaway cleanup job), security incidents (malicious data destruction), or migration activities that were not communicated to the data team.

Key metrics to track include:

  • Delete rate per table per minute: Baseline this during normal operations and alert on deviations beyond two standard deviations.
  • Delete-to-insert ratio: A sudden increase in this ratio often signals a problem. Under normal operation, most tables have a relatively stable ratio.
  • Soft delete backlog: If you are running soft deletes, monitor the count of rows where _deleted = true. A continuously growing backlog may indicate that your compliance anonymization job is not running or is falling behind.
  • Tombstone lag in Kafka: If tombstone records are not being produced (due to a connector misconfiguration), compacted topics will retain stale data for deleted keys indefinitely.

Setting up alerts on these metrics is straightforward with any modern observability stack. The investment is small relative to the cost of discovering, weeks after the fact, that a batch delete wiped out critical data and no one noticed.


Delete handling is where the simplicity of CDC meets the complexity of real-world data architectures. There is no single correct approach - the right strategy depends on your destination’s capabilities, your compliance obligations, your analytical requirements, and your tolerance for query complexity.

Soft deletes preserve history and simplify recovery at the cost of query discipline. Hard deletes keep destinations clean and compliant at the cost of losing historical data. Tombstones ensure that Kafka’s compacted topics reflect the true state of the source. Most production pipelines use a combination of all three, configured per destination and per use case.

The important thing is to make a deliberate choice. Leaving delete handling to chance - or discovering your pipeline’s behavior only when a GDPR auditor asks where the deleted user’s data went - is a risk that no data team should accept.

Need configurable delete handling across all your destinations? Start a free trial of Streamkap and configure soft deletes, hard deletes, or both - per pipeline, per destination - in minutes.