Architecture & Patterns

10 min read

Iceberg Partition Evolution: Schema Changes Without Rewriting Your Data Lake

How to change your Iceberg partition strategy in a running production table: hidden partitioning, spec versioning, the step-by-step migration pattern, and the monitoring signals that warn of skew before it hits query latency.

Six months into running CDC from Postgres into Iceberg, a team had 90 days of data and a problem. Every table was partitioned by day(order_time). Their three most-expensive queries all filtered by customer_region. Each query scanned every region’s files for every day in the window, applied customer_region as a row-level predicate, and discarded the large majority of what it read. S3 costs climbed. Query p95 crept past 30 seconds.

In Iceberg, it was an ALTER statement. Understanding exactly why that’s possible is what separates a planned migration from an emergency one.

Why Partition Strategy Changes Matter in Streaming Ingestion

Partition strategy shapes both write cost and read performance, and the two pull in opposite directions. A time-based partition (day, month) works well for a continuous CDC write stream: each new batch of changes lands in the current time partition, access is sequential, and file accumulation is predictable. For point-in-time reads, it’s efficient. But once your most frequent queries filter on a second column that isn’t in the partition spec, those benefits erode quickly.

The typical failure pattern: a team picks day(event_time) at setup because it seems like the obvious choice. Six months later, analytics reports that regional dashboards take 40 seconds to load. The queries look like WHERE event_time > '2026-01-01' AND region = 'eu-west-1'. The time filter prunes months of data. The region filter runs as a predicate across every file in every matching day partition. For 120 days of data across four regions, that’s a scan over four times as many files as the query actually needs.

For streaming CDC workloads, this compounds faster than with batch. The write rate is continuous. A misaligned partition strategy doesn’t just produce slow queries at launch; it produces queries that get slower every day as the table grows.

With a format that embeds partition columns in physical file paths, correcting course means moving data. With Iceberg, the partition spec is metadata.

How Iceberg Supports Partition Evolution

Iceberg’s approach rests on two properties working together.

The first is hidden partitioning. In Hive and most earlier formats, partition columns are baked into the directory structure: country=us/year=2026/month=06/ is a real path prefix, not metadata. To change the partition scheme, you change that directory layout, which means moving or rewriting data files. Iceberg separates partition logic from the physical file layout. Partition values are derived from data columns through transforms, but the files themselves don’t encode partition values in their paths. Changing the partition spec doesn’t touch a single data file.

The second property is partition spec versioning. When you run an ALTER that modifies the partition spec, Iceberg records a new partition spec version in the table metadata. Old files retain their original spec. New writes use the current spec. Both coexist in the table, and the manifest tracks which spec applies to each file group.

Query engines that implement the full Iceberg spec handle a mixed-spec table transparently. A table with spec v1 (day(order_time)) covering six months of history and spec v2 (day(order_time) + identity(region)) covering recent data is readable in a single scan. The planner reads the manifest, identifies which spec applies to each file group, and constructs a query plan that prunes on both fields for spec-v2 files and on the time field only for spec-v1 files. No data is inaccessible.

The available partition transforms are identity, bucket(N), truncate(N), year, month, day, hour, and void. You can add a new field, remove an existing one, or replace a transform. All of these are metadata operations. Nothing writes to S3 until you explicitly rewrite data files.

One practical constraint worth confirming before you evolve: not every query engine fully implements partition spec evolution. Spark, Trino, and Flink have solid support. Older versions of some engines fall back to full table scans when they encounter a spec they don’t fully parse, which makes the evolution worse, not better. Before changing a table that multiple engines read, verify each consumer’s Iceberg library version against the multi-engine compatibility guide.

Step-by-Step Migration: From Partition Scheme A to B

The scenario: table cdc.orders partitioned by months(order_date). Query analysis shows most queries filter on both order_date and customer_region. Adding identity(customer_region) as a second partition field will let those queries prune on both dimensions.

Check the data distribution first

Before adding identity(customer_region), check how many distinct values the column has. If it has 300 distinct values and the table currently has 60 monthly partitions, the new spec creates up to 18,000 partition combinations. That’s file sprawl, and it compounds under continuous ingest. For high-cardinality columns, bucket(N, customer_region) is the safer choice: it hashes values into N buckets, controls partition count, and still enables pruning on equality predicates.

For a well-behaved customer_region column with 8-20 distinct values, identity works fine.

Add the partition field

Using the Iceberg Spark SQL extension:

ALTER TABLE catalog.cdc.orders
ADD PARTITION FIELD identity(customer_region);

This is a metadata commit. No data moves, no write lock is held. Ingest continues through the operation without interruption.

Confirm the change:

DESCRIBE EXTENDED catalog.cdc.orders;

Look for the new partition spec ID. You should see the incremented spec ID alongside the original spec definition.

Validate the query plan

Run EXPLAIN on a representative query before measuring execution times:

EXPLAIN SELECT * FROM catalog.cdc.orders
WHERE order_date > DATE '2026-05-01'
  AND customer_region = 'us-east-1';

For data written after the evolution, both partition fields should appear in the filter plan. For historical data, expect the planner to prune on order_date and apply customer_region as a row-level filter. That’s correct behavior, not a bug.

Ingest continues

If you’re using Streamkap’s Iceberg connector for CDC, new writes after the spec change land in the new partition structure automatically. No connector restart or reconfiguration required. Old files stay under the previous spec and remain fully queryable.

Compaction and Query Performance After Evolution

After the evolution, the table has two file populations: spec-v1 files (prunable on order_date only) and spec-v2 files (prunable on both fields). Queries filtering on customer_region get the full benefit only for new-spec data. For historical spec-v1 files, the engine still applies customer_region as a row-level predicate.

Whether to compact old files into the new spec depends entirely on your query patterns. For workloads that primarily scan recent data, historical partitions are rarely touched and compaction isn’t worth the cost. For analytical queries that regularly look back across a wide time range with region filters, rewriting old files pays off. A rough break-even point: if a query scans historical data more than twice a week and the region filter discards more than half of what it reads, compaction will recoup its cost within weeks.

Iceberg’s rewrite_data_files handles this without disrupting concurrent reads or writes. Scope it to a specific time range and the procedure uses the current spec automatically:

CALL catalog.system.rewrite_data_files(
  table => 'cdc.orders',
  where => 'order_date < DATE ''2026-04-01'''
);

Run this incrementally, one or two months at a time. A single all-at-once rewrite of six months of data competes with active ingest for S3 bandwidth and compute. Incremental runs keep peak cost predictable.

Iceberg’s optimistic concurrency makes the compaction safe to run alongside CDC writes. The procedure writes new files and atomically swaps them in via a snapshot commit. Concurrent writes don’t conflict unless they modify the same file groups, which is unlikely for time-range scoped compaction.

Monitoring and Rollback Posture

Watching for partition skew

The most common problem after evolution is skew: one partition accumulates far more files than the rest. This happens when the new partition field’s value distribution is uneven. After a few days of new-spec writes, check the distribution:

SELECT partition, count(*) AS file_count
FROM catalog.cdc.orders.files
GROUP BY partition
ORDER BY file_count DESC
LIMIT 20;

A healthy distribution shows file counts within a few times of each other across partitions in the same time window. If one value dominates, that partition will drive disproportionate compaction time and can become a read hotspot. The fix: evolve the spec again, switching from identity to bucket(N) on the problematic column.

Scan efficiency after evolution

Compare EXPLAIN ANALYZE output for your representative queries before and after evolution. The ratio of files scanned to total files should improve for queries that filter on the new partition field. If it doesn’t improve for new-spec data, the query engine may not be reading the partition spec correctly from the manifest. Verify the engine’s Iceberg library version before assuming a bug in the spec change.

Rollback options

Partition spec evolution isn’t reversible by rolling back a snapshot. The partition spec is a metadata-layer concern separate from the snapshot history; you can roll back a snapshot and still be on the new partition spec.

If the new scheme underperforms, running another ALTER removes or replaces the added field. New writes after that change use the reverted spec. Files written during the evolution window stay under spec v2 and remain queryable.

For a harder failure where you need to fully undo the evolution, the cleanest path is cloning the table at a pre-evolution snapshot and cutting over reads to the clone. That’s only possible while snapshots from the pre-evolution period still exist. Set snapshot retention to cover your incident detection window before touching the partition spec in production, and hold off on expireSnapshots until the new scheme is validated. Once those snapshots expire, that recovery path is gone.

Where to next?

Related resources

Architecture & Patterns June 8, 2026

Bursty Workloads and SLA-Compliant Autoscaling: SaaS vs BYOC Trade-Offs

Navigate autoscaling strategy for bursty CDC workloads with tight SLAs. Compare SaaS elasticity against reserved BYOC capacity, validate latency at production volume, and right-size destination throughput.

Architecture & Patterns March 17, 2026

Do AI Agents Need Kafka? When Managed Streaming Makes More Sense

AI agents need real-time event streams, but that doesn't mean you need to run Kafka yourself. Learn when self-managed Kafka makes sense for agent workloads and when a managed streaming platform is the better choice.

Architecture & Patterns March 12, 2026

Database Replication Patterns: Active-Active, CDC, and Beyond

A practical guide to database replication patterns — active-passive, active-active, CDC-based, snapshot, and multi-region. When to use each and common pitfalls.

Tell us where you're headed

Two quick details and we'll get you set up.

Loading…

Trusted by data teams at SpotOn, ShipMonk, Fleetio and more.