<--- Back to all resources

Engineering

February 25, 2026

12 min read

CDC to Star Schema: Building Dimensional Models from Change Streams

How to transform CDC event streams into star schema dimensional models in real time. Covers fact table loading, dimension handling, and SCD patterns with Flink.

TL;DR: • CDC streams carry the raw changes needed to populate both fact and dimension tables in a star schema. • Fact table loading maps naturally to CDC inserts and updates on transactional tables. • Dimension tables require SCD logic - Type 1 for overwrites, Type 2 for history preservation. • Flink SQL can transform normalized CDC events into denormalized star schema structures in real time.

If you’ve spent time in the data warehouse world, the star schema needs no introduction. It is still the foundation of most analytical workloads: a central fact table surrounded by dimension tables, optimized for the kinds of aggregation and filtering that BI tools do all day long.

What has changed is how data gets into these schemas. The traditional pattern of nightly batch ETL jobs that extract from source systems, transform rows through staging tables, and load into the warehouse is being replaced by something faster. Change Data Capture (CDC) gives you a continuous stream of every insert, update, and delete from your transactional databases. Pair that with a stream processor like Apache Flink, and you can populate a star schema in real time instead of waiting for the next scheduled batch window.

This article walks through the mechanics of that transformation: how CDC events map to facts and dimensions, how to handle slowly changing dimensions in a streaming context, and what the Flink SQL looks like in practice.

Quick Refresher: Star Schema Structure

A star schema has two types of tables:

  • Fact tables store measurable events or transactions. An orders_fact table might contain order_id, customer_key, product_key, order_date_key, quantity, and total_amount. The numeric measures are what analysts aggregate. The keys reference dimensions.
  • Dimension tables store the descriptive context around those events. A customer_dim table holds customer_key, customer_name, email, city, state, and signup_date. Analysts filter and group by these attributes.

The “star” shape comes from the fact table sitting at the center with foreign keys radiating outward to each dimension. This denormalized layout trades storage efficiency for query speed, and every major warehouse engine is optimized for it.

The challenge with batch ETL has always been the T in ETL. Source databases are normalized, often to third normal form. The transformation step must join, denormalize, generate surrogate keys, and apply business logic before loading. When this runs once a day, the warehouse is always behind. When it runs continuously from CDC streams, the warehouse stays current.

How CDC Events Map to Facts

A CDC stream from a transactional orders table produces events that look roughly like this (using Debezium’s format):

{
  "op": "c",
  "before": null,
  "after": {
    "order_id": 50421,
    "customer_id": 1782,
    "product_id": 304,
    "quantity": 2,
    "total_amount": 79.98,
    "order_date": "2026-02-25T14:32:00Z"
  },
  "ts_ms": 1740494520000
}

The op field tells you the operation: c for create (insert), u for update, d for delete. For fact table loading, the mapping is straightforward:

  • Inserts (op: c) become new rows in the fact table. You look up the surrogate keys for each dimension (customer, product, date) and write the fact row.
  • Updates (op: u) either update the existing fact row or insert a new corrected row, depending on your warehouse’s merge strategy.
  • Deletes (op: d) are less common for transactional facts but might soft-delete (set a flag) or hard-delete the row.

Most e-commerce and SaaS systems generate far more inserts than updates on their transactional tables, so the fact loading path is dominated by append operations. This is good news for streaming because appends are the simplest and fastest warehouse operation.

How CDC Events Map to Dimensions

Dimensions are trickier. A CDC event on your customers table might signal that a customer changed their address. The question is: do you overwrite the old address or keep both versions? This is the slowly changing dimension (SCD) problem, and the answer determines your entire dimension loading strategy.

SCD Type 1: Overwrite

Type 1 simply replaces the old value with the new one. When a CDC update arrives for a customer’s city, you update the existing row in customer_dim. History is lost, but the dimension stays compact and queries always reflect the current state.

This works well for attributes that are corrections (fixing a misspelled name) or where historical accuracy of the dimension attribute does not affect your analysis.

In a streaming pipeline, Type 1 is the simplest path. Each CDC update event triggers an UPDATE or MERGE on the dimension table, using the natural key (e.g., customer_id) to find the target row.

SCD Type 2: Versioned History

Type 2 preserves history by creating a new row for each change. When a customer moves from Austin to Denver, the Austin row gets an effective_end_date set to the current timestamp, and a new row is inserted with Denver as the city and an open-ended effective_end_date (often 9999-12-31). A new surrogate key is generated for the Denver row.

This means your fact table can correctly associate old orders with the Austin address and new orders with the Denver address, because each version of the customer has its own surrogate key.

In streaming, Type 2 requires more work:

  1. Receive the CDC update event (which contains both before and after values).
  2. Close out the current dimension row by setting effective_end_date and marking is_current = false.
  3. Insert a new dimension row with the after values, a new surrogate key, effective_start_date set to the event timestamp, and is_current = true.
  4. All subsequent fact events for that customer must resolve to the new surrogate key.

The before and after fields in a CDC event are exactly what you need for this. The before object gives you the values to match the existing row, and the after object gives you the values for the new row.

Surrogate Key Generation in Streaming

Batch ETL systems typically use the warehouse’s auto-increment sequence or a lookup table to assign surrogate keys. In a streaming pipeline, you need a strategy that works without round-tripping to the warehouse for each event.

Several approaches work:

  • Deterministic hashing: Compute a hash of the natural key plus a version identifier (e.g., MD5(customer_id || effective_start_date)). This is reproducible and does not require coordination, but collisions are theoretically possible (and vanishingly rare with modern hash functions).
  • Snowflake-style IDs: Use a distributed ID generator that produces globally unique, time-ordered 64-bit integers. Flink can integrate with libraries that implement this pattern.
  • Warehouse-assigned on load: Let the warehouse assign the surrogate key during the merge operation. This offloads the problem but means you cannot reference the key until the row lands.

For most streaming star schema pipelines, deterministic hashing strikes the best balance. It requires no external state, works across parallel Flink subtasks, and produces stable keys that are the same if you reprocess the same event.

Flink SQL gives you a way to express these transformations declaratively. Here is a simplified example of how you might join CDC streams from orders and customers to produce enriched fact rows.

First, define your CDC source tables. If you are using Streamkap to ingest CDC into Kafka, these map to the Kafka topics Streamkap produces:

CREATE TABLE orders_cdc (
  order_id INT,
  customer_id INT,
  product_id INT,
  quantity INT,
  total_amount DECIMAL(10, 2),
  order_date TIMESTAMP(3),
  PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
  'connector' = 'kafka',
  'topic' = 'streamkap.public.orders',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'debezium-json'
);

CREATE TABLE customers_cdc (
  customer_id INT,
  customer_name STRING,
  email STRING,
  city STRING,
  state STRING,
  signup_date DATE,
  PRIMARY KEY (customer_id) NOT ENFORCED
) WITH (
  'connector' = 'kafka',
  'topic' = 'streamkap.public.customers',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'debezium-json'
);

Now you can join the two streams. Flink’s temporal join lets you match each order event with the customer dimension as it existed at the time of the order:

CREATE TABLE orders_fact_sink (
  order_id INT,
  customer_key BIGINT,
  product_key INT,
  order_date_key INT,
  quantity INT,
  total_amount DECIMAL(10, 2),
  customer_name STRING,
  customer_city STRING,
  PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:postgresql://warehouse:5432/analytics',
  'table-name' = 'orders_fact'
);

INSERT INTO orders_fact_sink
SELECT
  o.order_id,
  HASH_CODE(CONCAT(CAST(c.customer_id AS STRING), '-', CAST(c.signup_date AS STRING))) AS customer_key,
  o.product_id AS product_key,
  CAST(DATE_FORMAT(o.order_date, 'yyyyMMdd') AS INT) AS order_date_key,
  o.quantity,
  o.total_amount,
  c.customer_name,
  c.city AS customer_city
FROM orders_cdc AS o
JOIN customers_cdc AS c
  ON o.customer_id = c.customer_id;

This is a simplified example. A production pipeline would include error handling, schema evolution logic, and likely a separate job for maintaining the dimension tables themselves. But the core pattern holds: CDC events flow through Flink, get joined and reshaped, and land in the warehouse already in star schema form.

Handling Late-Arriving Facts

Late-arriving facts are events that reference a dimension value that has already changed by the time the fact arrives. Suppose a customer moved from Austin to Denver at 2:00 PM, and an order placed at 1:55 PM only arrives in the stream at 2:05 PM. If you naively look up the current customer dimension, you would associate that order with Denver, not Austin.

There are several strategies to handle this:

  • Event-time processing: Flink’s event-time semantics let you process events based on their order_date rather than when they arrive. Combined with a temporal join that uses the customer’s version history, you can match the order to the correct dimension version.
  • Watermarks and allowed lateness: Configure Flink watermarks to tolerate a certain amount of lateness (say, 10 minutes). Events that arrive within this window are processed correctly. Events beyond the watermark are either dropped or routed to a dead-letter topic for manual correction.
  • Correction jobs: Run a periodic reconciliation job that compares the fact table against the dimension history and fixes any mismatched keys. This is a safety net, not the primary mechanism.

In practice, most CDC pipelines have very low latency (seconds, not minutes), so late arrivals are rare. But the edge case matters for financial or regulatory workloads where historical accuracy is mandatory.

Practical Loading Patterns

How you actually land the transformed rows in your warehouse depends on the destination. Here are common patterns:

Snowflake: Use Snowpipe Streaming to load fact rows as they arrive. Streamkap supports Snowflake as a destination, writing directly through the Snowpipe Streaming API with sub-minute latency. For SCD Type 2 dimensions, you can use Snowflake’s MERGE statement triggered by a stream on the staging table.

BigQuery: BigQuery’s Storage Write API supports streaming inserts. Fact rows can land in real time. Dimension updates typically go through a merge query on a micro-batch cadence (every few minutes) since BigQuery does not support single-row upserts as efficiently.

ClickHouse: ClickHouse’s ReplacingMergeTree engine handles SCD Type 1 natively by keeping only the latest version of each row (based on a version column). Fact tables use MergeTree with append-only inserts, which matches the streaming insert pattern well.

Databricks / Delta Lake: Delta Lake’s MERGE INTO supports both SCD Type 1 and Type 2 patterns. Flink can write to Delta Lake using the Delta connector, and the merge logic runs as part of the table maintenance.

Streamkap connects to all of these destinations and handles the CDC ingestion from source databases into Kafka. The Flink transformation layer sits between Streamkap’s Kafka topics and the warehouse sink, or if your transformation needs are simpler (e.g., column filtering, renaming, basic joins), Streamkap’s built-in transforms can handle them without a separate Flink deployment.

What This Looks Like End to End

Putting the pieces together, a streaming star schema pipeline looks like this:

  1. Source databases (PostgreSQL, MySQL, MongoDB) have CDC enabled.
  2. Streamkap captures changes and writes them to Kafka topics, one per source table, in Debezium format.
  3. Flink SQL jobs consume from those topics, join and reshape events, generate surrogate keys, and apply SCD logic.
  4. Warehouse sink connectors write the transformed fact and dimension rows to Snowflake, BigQuery, ClickHouse, or another analytical store.
  5. BI tools query the star schema and see data that is minutes old instead of hours or days old.

The batch ETL version of this same pipeline would have a scheduler (Airflow, dbt Cloud) orchestrating extract queries, staging table loads, transformation SQL, and final inserts. It would run on a schedule and finish sometime later. The streaming version runs continuously. There is no schedule, no orchestrator, and no gap between when data changes and when it is queryable.

Trade-offs Worth Knowing

Streaming star schema loading is not free. A few things to watch for:

  • Ordering guarantees: CDC events for the same row arrive in order within a single Kafka partition, but events across different tables may interleave. Your Flink job needs to handle the case where a fact event arrives before the corresponding dimension event. Temporal joins with buffering handle this, but they consume memory.
  • Schema evolution: When a column is added to a source table, the CDC format changes. Your Flink jobs and sink schemas need to adapt. Streamkap handles schema evolution on the CDC side, but the Flink SQL and warehouse DDL need updating too.
  • Idempotency: Network failures and restarts mean events can be delivered more than once. Your sink must handle duplicate writes gracefully, typically through upserts on the primary key rather than blind inserts.
  • Complexity: A streaming pipeline has more moving parts than a batch SQL script. Kafka, Flink, connectors, and monitoring all need care and feeding. The payoff is freshness, but the operational cost is real.

For teams already running CDC pipelines for replication or event-driven architectures, adding a star schema transformation layer is an incremental step rather than a wholesale infrastructure change. If you are using Streamkap for CDC ingestion, the Kafka topics are already there. The work is in writing the Flink SQL that reshapes them.

The data warehouse is not going away, and neither is the star schema. What is changing is the expectation of freshness. Analysts no longer accept “yesterday’s data.” Building dimensional models from CDC streams gives them what they actually want: a well-modeled, query-friendly schema that reflects what is happening right now.