What is the difference between event time and processing time in streaming?

Event time is when the event actually occurred at the source - for example, when a user clicked a button. Processing time is when the event was ingested by your streaming system. Late-arriving data, network delays, and system outages can create large gaps between the two. Analytics based on processing time can be misleading; event time is almost always the correct choice for business metrics.

How do you handle schema changes in a real-time pipeline without breaking downstream consumers?

Schema evolution strategies include: using a schema registry (like Confluent Schema Registry or AWS Glue) to enforce compatibility rules, applying additive-only changes (new nullable columns are safe; deleting or renaming columns breaks consumers), and versioning your topics or streams so consumers can migrate at their own pace. CDC tools that emit schema change events allow your pipeline to react dynamically rather than fail silently.

What is the best way to handle NULL values in a streaming pipeline?

NULLs fall into three categories: missing data (the field was never populated), unknown data (the field exists but the value is not known), and not-applicable data (the field does not apply to this record type). Decide on a handling strategy per field: coalesce to a default, filter the record, route to a dead-letter queue for inspection, or pass NULLs through and let the query layer handle them. Document the strategy so consumers understand what a NULL means in your data model.

Can you build a star schema directly from CDC change streams?

Yes. CDC streams emit row-level inserts, updates, and deletes with a before/after image. By routing these events to upsert-capable targets (like Apache Iceberg, Delta Lake, or a warehouse with MERGE support) and partitioning dimension tables separately from fact tables, you can maintain a continuously updated star schema. The key challenge is handling slowly changing dimensions (SCDs) - Type 2 SCDs require tracking effective dates, which streaming pipelines can compute using event time from the CDC payload.

<--- Back to all resources

Engineering

February 25, 2026

12 min read

Real-Time Data Preparation: Getting Raw Data Analytics-Ready as It Flows

Learn how to normalize, clean, and transform raw CDC and streaming data into analytics-ready datasets - schema handling, timestamps, NULLs, and star schemas.

TL;DR: • Raw CDC data requires normalization before it is useful for analytics - schema drift, NULL values, and inconsistent timestamps are the main culprits. • Event time and processing time are fundamentally different; choosing the wrong one corrupts your analytics. • Building a star schema from change streams is achievable with the right partitioning and upsert strategy. • Data preparation should happen as close to the source as possible to reduce downstream complexity.

Introduction

A CDC event lands in your Kafka topic. It contains a timestamp, a handful of columns, some NULLs, a string that might be a date, and an operation type (c, u, or d). Congratulations - you have data. Now try running a business query against it.

Raw streaming data is not analytics-ready data. The gap between “data arrives” and “data is useful” is where most real-time data projects stumble. Teams invest heavily in getting data moving - Kafka, connectors, replication - and then discover that the hard work is actually in what happens next: normalizing schemas, resolving timestamps, handling missing values, enforcing types, and shaping streams into dimensional models that analysts can actually query.

This guide covers the full data preparation layer for streaming and CDC pipelines: what to fix, how to fix it, and where to fix it in your architecture.

Why Raw CDC Data Is Not Query-Ready

Change Data Capture tools replicate the exact shape of source database tables. That sounds useful until you realize that operational databases are optimized for writes, not reads. The schema decisions made five years ago by a backend team under deadline pressure are now your problem.

Common issues in raw CDC payloads:

Inconsistent date formats - Some columns use Unix epoch milliseconds, others use ISO 8601 strings, others use database-native types that serialize differently per connector.
String columns holding structured data - JSON blobs stored as TEXT, comma-separated lists in a single column, pipe-delimited identifiers.
Implicit NULL semantics - An empty string "" and a NULL often mean the same thing in operational systems but behave differently in analytics.
Denormalized operational tables - Columns that repeat data from parent tables because JOINs were too slow on the write path.
Missing business context - A status column with values 1, 2, 3 that are only meaningful if you look up a separate reference table.
Before/after payloads - CDC tools emit both the pre-change and post-change state of a row. Analytical queries typically want only the current state.

The goal of data preparation is to convert this raw operational snapshot into a clean, typed, semantically enriched dataset that analysts can query with confidence.

Schema Normalization

Enforcing Types Early

Type coercion should happen as early as possible in the pipeline - ideally in a stream processor that reads from raw topics and writes to curated topics. Allowing untyped strings to flow all the way to the data warehouse means every downstream consumer must independently solve the same coercion problem.

A practical approach using SQL in a stream processor:

-- Normalize raw order events from CDC
SELECT
  CAST(order_id AS BIGINT)                           AS order_id,
  CAST(customer_id AS BIGINT)                        AS customer_id,
  TRIM(UPPER(status))                                AS status,
  CAST(total_amount AS DECIMAL(18,2))                AS total_amount,
  TO_TIMESTAMP(created_at, 'yyyy-MM-dd HH:mm:ss')   AS created_at,
  CURRENT_TIMESTAMP                                  AS _processed_at,
  __op                                               AS _cdc_operation
FROM raw_orders_cdc
WHERE __op IN ('c', 'u', 'r');

Key decisions here:

Explicit CAST for every column - no implicit coercion that might silently truncate data
String normalization (TRIM, UPPER) applied consistently
CDC metadata columns (__op, __ts_ms) preserved with a _ prefix so consumers can distinguish them from business columns

Handling Schema Drift

Source schemas change. A column gets added, renamed, or dropped. If your pipeline treats the source schema as immutable, any change will break production.

Strategies for resilient schema handling:

Strategy	How It Works	Best For
Schema Registry	Producers register schemas; consumers validate compatibility before consuming	High-throughput Kafka pipelines
Schema-on-read	Store raw JSON/Avro; apply schema at query time	Exploratory analytics, low-volume
Schema versioning	Route schema versions to separate topics/tables	Controlled migration environments
Additive-only policy	Only allow new nullable columns; block drops/renames	Teams with mixed maturity

For most production CDC pipelines, a schema registry with backward compatibility enforced is the right default. New nullable columns are allowed; everything else requires a migration plan.

Timestamp Handling

Timestamps are the most error-prone element in streaming data preparation. Getting them wrong produces dashboards that show sales in the future, latency metrics that are negative, and aggregations that double-count events.

The Four Timestamps in a CDC Pipeline

Every CDC event carries multiple timestamps, each measuring something different:

Database commit time (__ts_ms in Debezium) - When the transaction was committed to the source database. This is the closest proxy to event time.
Source row timestamp (e.g., created_at, updated_at) - A timestamp stored in the source table. This is application-defined and may be set by application code before the write, making it potentially earlier than the commit time.
Connector ingest time - When the CDC connector read the event from the transaction log. Usually milliseconds after commit time, but can lag significantly if the connector is behind.
Processing time - When your stream processor or warehouse received and processed the event. This is the least meaningful for business analytics.

Event Time vs Processing Time

This distinction matters enormously for aggregate queries.

Event time answers: “How many orders were placed on Tuesday?” Processing time answers: “How many order events did we process on Tuesday?”

Under normal conditions these are nearly identical. But when:

Your pipeline has a backlog and is catching up
A source system was offline for hours
You are replaying historical data during a migration

…processing time will be wildly wrong for business metrics. Always use event time (database commit time or source row timestamp) for business aggregations.

-- Wrong: uses processing time
SELECT DATE(CURRENT_TIMESTAMP) AS date, COUNT(*) AS orders
FROM order_events
GROUP BY 1;

-- Right: uses event time from source
SELECT DATE(event_time) AS date, COUNT(*) AS orders
FROM (
  SELECT TO_TIMESTAMP(created_at) AS event_time
  FROM order_events
)
GROUP BY 1;

Timezone Normalization

Normalize all timestamps to UTC at ingestion time. Never store timezone-aware timestamps in mixed-offset formats. Apply timezone conversion only at the presentation layer (BI tool or report).

-- Convert application-local timestamp to UTC
CONVERT_TZ(created_at, 'America/New_York', 'UTC') AS created_at_utc

NULL Handling Strategies

A NULL in an analytics dataset is ambiguous. Does it mean the data was never collected? That the field does not apply? That the source system allows nulls indiscriminately?

Classify NULLs Before Deciding

Before choosing a handling strategy, classify the NULL semantics for each column:

NULL Type	Meaning	Example	Recommended Handling
Missing	Data was not collected	`phone_number` before mobile era	Coalesce to empty string or flag column
Unknown	Value exists but is not known	`age` of anonymous user	Preserve NULL, add `_is_known` boolean
Not applicable	Field does not apply to this record	`shipping_address` for digital goods	Coalesce to `'N/A'` or filter from dimension
System NULL	Source system bug or migration artifact	Orphaned foreign keys	Route to dead-letter queue for investigation

Practical Patterns

Coalesce with a sentinel value - Safe when NULLs mean “missing” and the sentinel will not appear naturally:

COALESCE(region, 'UNKNOWN') AS region

Add a boolean presence flag - Useful for nullable metrics where the NULL itself is meaningful:

discount_amount,
(discount_amount IS NOT NULL) AS has_discount

Dead-letter routing - When a NULL indicates a data quality problem that should not silently propagate:

-- Route to DLQ if required business key is NULL
CASE WHEN customer_id IS NULL
  THEN 'dead_letter'
  ELSE 'curated'
END AS output_topic

String Cleaning

Operational databases accumulate decades of inconsistent string data. Before strings reach analytics, apply a standard cleaning pipeline:

SELECT
  -- Whitespace normalization
  TRIM(company_name)                          AS company_name,

  -- Case normalization for lookups
  UPPER(TRIM(country_code))                   AS country_code,

  -- Remove non-printable characters
  REGEXP_REPLACE(notes, '[\\x00-\\x1F]', '') AS notes,

  -- Extract structured data from embedded JSON
  JSON_VALUE(metadata, '$.source')            AS acquisition_source,

  -- Standardize boolean-as-string patterns
  CASE LOWER(TRIM(is_active))
    WHEN 'true' THEN true
    WHEN '1'    THEN true
    WHEN 'yes'  THEN true
    ELSE false
  END AS is_active
FROM raw_customers;

Building Star Schemas from Change Streams

A star schema organizes data into fact tables (events, transactions) and dimension tables (customers, products, locations). Maintaining a star schema from a CDC stream requires handling upserts correctly.

The Upsert Problem

CDC streams emit three operation types: insert (c), update (u), and delete (d). Analytical stores generally do not support row-level updates natively (columnar formats are optimized for appends). The solutions:

Merge/Upsert targets - Apache Iceberg, Delta Lake, and Apache Hudi all support MERGE operations. A streaming job applies CDC events as upserts, maintaining a current-state table:

-- Iceberg MERGE for dimension upserts
MERGE INTO dim_customers AS target
USING (
  SELECT customer_id, name, email, region, event_time
  FROM customer_cdc_stream
  WHERE _cdc_operation IN ('c', 'u')
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET name = source.name, email = source.email,
             region = source.region, updated_at = source.event_time
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, region, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          source.region, source.event_time, source.event_time);

Append-only with deduplication - Append all CDC events including operation type, then use a ROW_NUMBER() window function to select the latest state per key:

SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY event_time DESC
    ) AS rn
  FROM customer_events
  WHERE _cdc_operation != 'd'
) WHERE rn = 1;

Partitioning for Query Performance

Partition your fact tables by date (using event time, not processing time). This enables partition pruning - queries that filter by date range read only relevant partitions instead of the full table.

-- Partition fact table by event date
CREATE TABLE fact_orders (
  order_id      BIGINT,
  customer_id   BIGINT,
  product_id    BIGINT,
  amount        DECIMAL(18,2),
  event_time    TIMESTAMP,
  event_date    DATE  -- partition key, derived from event_time
)
PARTITIONED BY (event_date);

For dimension tables, partitioning is less critical because dimensions are small. Focus instead on clustering by the primary key so lookups are fast.

Where to Prepare Data in Your Architecture

Data preparation can happen at multiple points in the pipeline. The right answer depends on your latency requirements and team structure:

Layer	Tool Examples	Latency	Best For
At the connector	Kafka Connect SMTs, Debezium transforms	Milliseconds	Simple field renaming, routing
Stream processor	Apache Flink, Kafka Streams	Seconds	Joins, aggregations, complex transforms
ELT in warehouse	dbt, Spark	Minutes–hours	Heavy historical backfills
BI/query layer	Looker LookML, Metabase	Query time	Presentation-only transforms

The general principle: apply corrections as early as possible, deferring only decisions that require warehouse-scale compute (complex deduplication across billions of rows) or business logic that changes frequently (better handled in dbt where analysts own it).

Platforms like Streamkap apply transformations at the connector layer before data lands in the destination, which reduces the volume of cleanup work needed downstream and keeps raw and curated data separate from the start.

A Practical Preparation Checklist

Before declaring a streaming dataset analytics-ready, verify:

All columns have explicit types - no implicit coercion at query time
All timestamps are in UTC and use event time, not processing time
NULL semantics are documented per column with a defined handling strategy
String columns are trimmed and have consistent casing for categorical values
CDC operation type is preserved as metadata but filtered from analytical queries
Schema changes are handled via a registry or versioning policy
Fact tables are partitioned by event date
Dimension tables support upserts (merge or deduplication view)
A data quality check runs on each batch (row counts, NULL rates, type errors)

Summary

The distance between raw CDC events and a reliable analytics dataset is not primarily a technology problem - it is a data modeling and discipline problem. The tools exist (stream processors, schema registries, lakehouse formats). The work is in applying them consistently: normalizing schemas before problems propagate, picking event time over processing time, documenting NULL semantics, and building upsert-capable dimension tables.

Teams that invest in this preparation layer early find that their downstream analytics are faster to build, easier to trust, and cheaper to maintain. Those that skip it spend most of their analytical capacity debugging data quality issues instead of answering business questions.

For related reading, see the guide on Change Data Capture fundamentals and the overview of streaming architectures for analytics.

Platform

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company