<--- Back to all resources

Engineering

February 25, 2026

11 min read

CDC to Apache Iceberg: Building a Real-Time Lakehouse

Stream database changes to Apache Iceberg tables for a real-time lakehouse. Learn how CDC and Iceberg's row-level operations enable ACID-compliant data lake analytics.

TL;DR: • Apache Iceberg's row-level delete and merge operations make it the ideal CDC target. Unlike Hive or basic Parquet, Iceberg can efficiently apply UPDATE and DELETE events from CDC streams. • The CDC-to-Iceberg pattern creates a real-time lakehouse where data is always fresh, ACID-compliant, and queryable by any engine (Spark, Trino, Flink, Snowflake). • Equality deletes and merge-on-read vs copy-on-write are key architectural decisions that affect both write throughput and read performance.

For years, organizations have treated their data lake and data warehouse as separate systems with separate pipelines, separate schemas, and separate budgets. The lakehouse architecture promises to end that split by putting structured, ACID-compliant tables directly on object storage. But the promise only works if the data inside those tables is fresh. Batch loads that run every few hours defeat the purpose. What you really need is a continuous stream of changes flowing from your operational databases into your lakehouse tables in near real time.

That is exactly what Change Data Capture (CDC) to Apache Iceberg delivers. CDC captures every INSERT, UPDATE, and DELETE from your source database and streams those events forward. Iceberg receives them and applies row-level mutations with full ACID guarantees. The result is a data lake that behaves like a warehouse: always current, always consistent, and queryable by any engine you choose.

Why Apache Iceberg Is the Right Target for CDC

Older table formats like Hive were designed for append-only workloads. You could add new partitions of data, but updating or deleting individual rows meant rewriting entire partitions. For CDC, where every database commit might contain a mix of inserts, updates, and deletes, this is a non-starter.

Iceberg was designed differently. Its core features align almost perfectly with what CDC pipelines demand:

  • Row-level deletes and updates. Iceberg tracks changes at the file and row level, not the partition level. A single DELETE event from your source database does not force a full partition rewrite.
  • ACID transactions. Every commit to an Iceberg table is atomic. Readers never see partially applied CDC batches. If a write fails midway, the table state remains consistent.
  • Schema evolution. When your source database runs an ALTER TABLE ADD COLUMN, Iceberg can absorb that change without rewriting existing data files. Old files simply return null for the new column.
  • Time travel. Every commit creates a new snapshot. You can query the table as it existed at any point in the past, which is invaluable for debugging CDC pipelines or auditing data lineage.
  • Hidden partitioning. Iceberg derives partition values from column data using transform functions. Consumers do not need to know the partitioning scheme to write efficient queries.

These properties make Iceberg the natural destination for CDC streams, and they are the reason the CDC-to-Iceberg pattern has become a foundation of modern lakehouse architectures.

Architecture: How CDC Gets to Iceberg

There are two broad patterns for delivering CDC data into Iceberg tables.

The DIY Pipeline

In the self-managed approach, you wire together several open-source components:

  1. CDC source connector (Debezium) reads the database transaction log and produces change events.
  2. Kafka acts as the durable transport layer, buffering events and decoupling producers from consumers.
  3. Stream processor (Flink or Spark Structured Streaming) consumes from Kafka, interprets the CDC envelope, and writes to Iceberg using MERGE INTO or upsert semantics.
PostgreSQL WAL → Debezium → Kafka → Flink SQL → Iceberg (S3)

This gives you full control, but you are responsible for managing Kafka clusters, Flink job lifecycles, schema registry, checkpointing, and Iceberg table maintenance.

The Managed Pipeline

Platforms like Streamkap collapse the entire pipeline into a managed service. You configure a source (PostgreSQL, MySQL, MongoDB) and a destination (Iceberg on S3, GCS, or ADLS), and the platform handles CDC extraction, schema mapping, delivery semantics, and Iceberg write operations. This eliminates the operational burden of running Kafka and Flink yourself while still giving you an open Iceberg table that any engine can query.

Merge-on-Read vs Copy-on-Write

When Iceberg applies an UPDATE or DELETE, it has two strategies for modifying existing data. The choice between them is the single most consequential architectural decision for CDC workloads.

Copy-on-Write (COW)

When a row is updated, Iceberg reads the entire data file that contains that row, rewrites it with the modification applied, and commits the new file. The old file is eventually garbage collected.

  • Reads are fast. Data files are always fully materialized with no deletes to reconcile.
  • Writes are slow. Every update or delete touches potentially large Parquet files. For CDC workloads with frequent small changes, write amplification is severe.

Merge-on-Read (MOR)

Instead of rewriting data files, Iceberg writes a separate delete file that marks which rows are invalidated. At read time, the engine merges delete files with data files to produce the correct result.

  • Writes are fast. A CDC batch that updates 1,000 rows across 50 data files only writes small delete files, not 50 new data files.
  • Reads are slightly slower. The engine must reconcile delete files on every scan. This overhead is manageable when delete files are kept small through regular compaction.

For CDC workloads, merge-on-read is almost always the right default. The write path is the hot path in a CDC pipeline, and MOR keeps it lightweight. You then schedule compaction jobs to periodically consolidate delete files and rewrite data files, bringing read performance back in line.

-- Set merge-on-read for a table
ALTER TABLE lakehouse.orders SET TBLPROPERTIES (
  'write.delete.mode' = 'merge-on-read',
  'write.update.mode' = 'merge-on-read',
  'write.merge.mode'  = 'merge-on-read'
);

Equality Deletes vs Positional Deletes

Iceberg supports two kinds of delete files, and which one your pipeline produces affects both correctness and performance.

Positional Deletes

A positional delete file says: “In data file X, delete the row at position 47.” This is precise and efficient to apply, but it requires that the writer knows exactly which data file and row position the target row lives in. That means the writer must first read the data file to find the row, which is expensive in streaming scenarios.

Equality Deletes

An equality delete file says: “Delete all rows where order_id = 12345.” The writer does not need to know which data file contains the row. It simply writes a delete file with the key columns and values, and the reader resolves it at scan time.

For CDC pipelines, equality deletes are the natural fit. A CDC DELETE event contains the primary key of the deleted row, but not its physical location in Iceberg’s data files. The pipeline writes an equality delete keyed on the primary key, and Iceberg handles the rest.

The trade-off is that equality deletes are more expensive to apply at read time, since the engine must check every row against the delete predicate. This is why compaction is essential: it resolves equality deletes into rewritten data files, eliminating the read-time overhead.

Applying CDC Changes with MERGE INTO

The standard pattern for applying CDC events to Iceberg is the MERGE INTO statement, supported by both Flink SQL and Spark SQL.

MERGE INTO lakehouse.orders AS target
USING cdc_stream AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.op = 'DELETE' THEN DELETE
WHEN MATCHED AND source.op = 'UPDATE' THEN UPDATE SET *
WHEN NOT MATCHED AND source.op != 'DELETE' THEN INSERT *;

Flink’s Iceberg sink also supports a native upsert mode that handles CDC semantics automatically when the table has a primary key defined:

CREATE TABLE lakehouse.orders (
  order_id BIGINT,
  customer_id BIGINT,
  total DECIMAL(10, 2),
  updated_at TIMESTAMP,
  PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
  'connector' = 'iceberg',
  'catalog-type' = 'hive',
  'warehouse' = 's3://lakehouse/warehouse',
  'write.upsert.enabled' = 'true'
);

Spark SQL

MERGE INTO lakehouse.orders AS target
USING staging_changes AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source.cdc_op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED AND source.cdc_op != 'd' THEN INSERT *;

In Spark, you typically micro-batch CDC events from Kafka into a staging table or DataFrame, then execute the MERGE on each batch. The foreachBatch pattern in Spark Structured Streaming is the standard approach.

Schema Evolution: Handling ALTER TABLE from the Source

One of the more painful aspects of CDC pipelines is dealing with schema changes. When someone adds a column to the source database, the CDC events start including that column, and the downstream table needs to absorb it.

Iceberg handles this gracefully. Adding a column to an Iceberg table does not rewrite any existing data files. Old Parquet files simply return null for the new column. The schema change is recorded in Iceberg’s metadata, and all subsequent writes include the new column.

-- Iceberg schema evolution
ALTER TABLE lakehouse.orders ADD COLUMN shipping_method STRING;

For fully automated schema propagation, you need the CDC pipeline itself to detect new columns in the event schema and issue the corresponding ALTER TABLE on the Iceberg side. Streamkap handles this automatically: when a source schema changes, the platform detects the new fields and evolves the Iceberg table schema before writing the first event that includes them.

In a DIY setup, you would typically use a Schema Registry (Confluent or Apicurio) to track schema versions in Kafka, and build logic in your Flink or Spark job to compare the incoming schema against the Iceberg table metadata and issue ALTER TABLE commands as needed.

Partitioning Strategy for CDC Data

Iceberg’s hidden partitioning is a major advantage over Hive-style partitioning. Instead of requiring producers to write data into explicit partition directories, Iceberg applies transform functions to column values to derive partition membership.

For CDC data, a common strategy is to partition by a timestamp column using a daily or hourly transform:

CREATE TABLE lakehouse.orders (
  order_id BIGINT,
  customer_id BIGINT,
  total DECIMAL(10, 2),
  created_at TIMESTAMP
) PARTITIONED BY (days(created_at));

This keeps partition sizes manageable and gives time-range queries automatic partition pruning. Because CDC events arrive in roughly chronological order, writes naturally cluster into the latest partition, reducing the number of data files touched per commit.

Iceberg also supports partition evolution. If you later decide that daily partitions are too coarse and you want hourly partitions, you can change the partitioning scheme without rewriting existing data:

ALTER TABLE lakehouse.orders REPLACE PARTITION FIELD days(created_at) WITH hours(created_at);

Existing data files retain their original partition layout. Only new writes use the updated scheme. Iceberg’s query planner handles both layouts transparently.

Table Maintenance: Keeping the Lakehouse Healthy

A CDC pipeline writes continuously, which means Iceberg tables accumulate small data files, delete files, and snapshots rapidly. Without regular maintenance, query performance degrades. Three maintenance operations are essential.

Compaction (Rewrite Data Files)

Compaction merges small data files into larger ones and resolves delete files by rewriting affected data files without the deleted rows. This is the single most important maintenance task for CDC tables.

-- Spark: rewrite small files
CALL lakehouse.system.rewrite_data_files('lakehouse.orders');

-- With size targets
CALL lakehouse.system.rewrite_data_files(
  table => 'lakehouse.orders',
  options => map('target-file-size-bytes', '536870912')
);

For tables with heavy CDC traffic, run compaction every few hours or trigger it when the number of delete files exceeds a threshold.

Expire Snapshots

Every Iceberg commit creates a snapshot. Over time, these accumulate and bloat the metadata. Expiring old snapshots removes their metadata entries and makes the associated data files eligible for garbage collection.

CALL lakehouse.system.expire_snapshots(
  table => 'lakehouse.orders',
  older_than => TIMESTAMP '2026-02-18 00:00:00',
  retain_last => 100
);

Orphan File Cleanup

Failed writes or incomplete compaction jobs can leave orphaned files on object storage that are not referenced by any snapshot. Periodic cleanup reclaims this storage.

CALL lakehouse.system.remove_orphan_files(
  table => 'lakehouse.orders',
  older_than => TIMESTAMP '2026-02-22 00:00:00'
);

Query Engine Compatibility

One of Iceberg’s defining strengths is its engine-agnostic design. A CDC pipeline writing to Iceberg creates a single, open table that can be queried by virtually any modern analytics engine.

  • Apache Spark has first-class Iceberg support through the iceberg-spark-runtime library. Full read/write, MERGE INTO, and all maintenance procedures are available.
  • Trino (formerly PrestoSQL) includes a native Iceberg connector. It supports reads, writes, equality deletes, and time travel queries. Trino is a popular choice for interactive lakehouse queries.
  • Apache Flink supports Iceberg through the flink-iceberg module, enabling both batch and streaming reads/writes. This makes Flink uniquely suited to both writing CDC data into Iceberg and reading from Iceberg for downstream stream processing.
  • Snowflake can read Iceberg tables as external tables, letting you query CDC-sourced lakehouse data alongside native Snowflake tables without copying data.
  • Amazon Athena and Google BigQuery both support querying Iceberg tables on S3 and GCS respectively, making serverless analytics over CDC data straightforward.
  • Dremio provides a lakehouse engine optimized for Iceberg with sub-second query performance on properly maintained tables.

This engine diversity is a key reason to choose Iceberg as your CDC target. You are not locked into a single vendor or query engine. Your data stays in open Parquet files with open metadata, and any engine that speaks Iceberg can read it.

Putting It All Together

The CDC-to-Iceberg pattern is not just an incremental improvement over batch ETL into a data lake. It is a fundamentally different architecture that delivers warehouse-grade consistency on data lake infrastructure. Your operational databases generate a continuous stream of changes. Those changes flow through a CDC pipeline and land in Iceberg tables within seconds or minutes. Any engine in your stack can then query that data with full ACID guarantees, schema evolution support, and time travel.

The key decisions you need to make are: merge-on-read vs copy-on-write (choose MOR for CDC), equality deletes vs positional deletes (equality for streaming), and how aggressively to run compaction. Get those right, and you have a lakehouse that is genuinely real-time.

Whether you build the pipeline yourself with Debezium, Kafka, and Flink, or use a managed platform like Streamkap to handle the operational complexity, the destination is the same: an open, queryable, always-fresh lakehouse powered by CDC and Apache Iceberg.