<--- Back to all resources

Engineering

February 25, 2026

12 min read

Real-Time Data Preparation: Getting Raw Data Analytics-Ready as It Flows

Learn how to normalize, clean, and transform raw CDC and streaming data into analytics-ready datasets - schema handling, timestamps, NULLs, and star schemas.

TL;DR: • Raw CDC data requires normalization before it is useful for analytics - schema drift, NULL values, and inconsistent timestamps are the main culprits. • Event time and processing time are fundamentally different; choosing the wrong one corrupts your analytics. • Building a star schema from change streams is achievable with the right partitioning and upsert strategy. • Data preparation should happen as close to the source as possible to reduce downstream complexity.

Introduction

A CDC event lands in your Kafka topic. It contains a timestamp, a handful of columns, some NULLs, a string that might be a date, and an operation type (c, u, or d). Congratulations - you have data. Now try running a business query against it.

Raw streaming data is not analytics-ready data. The gap between “data arrives” and “data is useful” is where most real-time data projects stumble. Teams invest heavily in getting data moving - Kafka, connectors, replication - and then discover that the hard work is actually in what happens next: normalizing schemas, resolving timestamps, handling missing values, enforcing types, and shaping streams into dimensional models that analysts can actually query.

This guide covers the full data preparation layer for streaming and CDC pipelines: what to fix, how to fix it, and where to fix it in your architecture.


Why Raw CDC Data Is Not Query-Ready

Change Data Capture tools replicate the exact shape of source database tables. That sounds useful until you realize that operational databases are optimized for writes, not reads. The schema decisions made five years ago by a backend team under deadline pressure are now your problem.

Common issues in raw CDC payloads:

  • Inconsistent date formats - Some columns use Unix epoch milliseconds, others use ISO 8601 strings, others use database-native types that serialize differently per connector.
  • String columns holding structured data - JSON blobs stored as TEXT, comma-separated lists in a single column, pipe-delimited identifiers.
  • Implicit NULL semantics - An empty string "" and a NULL often mean the same thing in operational systems but behave differently in analytics.
  • Denormalized operational tables - Columns that repeat data from parent tables because JOINs were too slow on the write path.
  • Missing business context - A status column with values 1, 2, 3 that are only meaningful if you look up a separate reference table.
  • Before/after payloads - CDC tools emit both the pre-change and post-change state of a row. Analytical queries typically want only the current state.

The goal of data preparation is to convert this raw operational snapshot into a clean, typed, semantically enriched dataset that analysts can query with confidence.


Schema Normalization

Enforcing Types Early

Type coercion should happen as early as possible in the pipeline - ideally in a stream processor that reads from raw topics and writes to curated topics. Allowing untyped strings to flow all the way to the data warehouse means every downstream consumer must independently solve the same coercion problem.

A practical approach using SQL in a stream processor:

-- Normalize raw order events from CDC
SELECT
  CAST(order_id AS BIGINT)                           AS order_id,
  CAST(customer_id AS BIGINT)                        AS customer_id,
  TRIM(UPPER(status))                                AS status,
  CAST(total_amount AS DECIMAL(18,2))                AS total_amount,
  TO_TIMESTAMP(created_at, 'yyyy-MM-dd HH:mm:ss')   AS created_at,
  CURRENT_TIMESTAMP                                  AS _processed_at,
  __op                                               AS _cdc_operation
FROM raw_orders_cdc
WHERE __op IN ('c', 'u', 'r');

Key decisions here:

  • Explicit CAST for every column - no implicit coercion that might silently truncate data
  • String normalization (TRIM, UPPER) applied consistently
  • CDC metadata columns (__op, __ts_ms) preserved with a _ prefix so consumers can distinguish them from business columns

Handling Schema Drift

Source schemas change. A column gets added, renamed, or dropped. If your pipeline treats the source schema as immutable, any change will break production.

Strategies for resilient schema handling:

StrategyHow It WorksBest For
Schema RegistryProducers register schemas; consumers validate compatibility before consumingHigh-throughput Kafka pipelines
Schema-on-readStore raw JSON/Avro; apply schema at query timeExploratory analytics, low-volume
Schema versioningRoute schema versions to separate topics/tablesControlled migration environments
Additive-only policyOnly allow new nullable columns; block drops/renamesTeams with mixed maturity

For most production CDC pipelines, a schema registry with backward compatibility enforced is the right default. New nullable columns are allowed; everything else requires a migration plan.


Timestamp Handling

Timestamps are the most error-prone element in streaming data preparation. Getting them wrong produces dashboards that show sales in the future, latency metrics that are negative, and aggregations that double-count events.

The Four Timestamps in a CDC Pipeline

Every CDC event carries multiple timestamps, each measuring something different:

  1. Database commit time (__ts_ms in Debezium) - When the transaction was committed to the source database. This is the closest proxy to event time.
  2. Source row timestamp (e.g., created_at, updated_at) - A timestamp stored in the source table. This is application-defined and may be set by application code before the write, making it potentially earlier than the commit time.
  3. Connector ingest time - When the CDC connector read the event from the transaction log. Usually milliseconds after commit time, but can lag significantly if the connector is behind.
  4. Processing time - When your stream processor or warehouse received and processed the event. This is the least meaningful for business analytics.

Event Time vs Processing Time

This distinction matters enormously for aggregate queries.

Event time answers: “How many orders were placed on Tuesday?” Processing time answers: “How many order events did we process on Tuesday?”

Under normal conditions these are nearly identical. But when:

  • Your pipeline has a backlog and is catching up
  • A source system was offline for hours
  • You are replaying historical data during a migration

…processing time will be wildly wrong for business metrics. Always use event time (database commit time or source row timestamp) for business aggregations.

-- Wrong: uses processing time
SELECT DATE(CURRENT_TIMESTAMP) AS date, COUNT(*) AS orders
FROM order_events
GROUP BY 1;

-- Right: uses event time from source
SELECT DATE(event_time) AS date, COUNT(*) AS orders
FROM (
  SELECT TO_TIMESTAMP(created_at) AS event_time
  FROM order_events
)
GROUP BY 1;

Timezone Normalization

Normalize all timestamps to UTC at ingestion time. Never store timezone-aware timestamps in mixed-offset formats. Apply timezone conversion only at the presentation layer (BI tool or report).

-- Convert application-local timestamp to UTC
CONVERT_TZ(created_at, 'America/New_York', 'UTC') AS created_at_utc

NULL Handling Strategies

A NULL in an analytics dataset is ambiguous. Does it mean the data was never collected? That the field does not apply? That the source system allows nulls indiscriminately?

Classify NULLs Before Deciding

Before choosing a handling strategy, classify the NULL semantics for each column:

NULL TypeMeaningExampleRecommended Handling
MissingData was not collectedphone_number before mobile eraCoalesce to empty string or flag column
UnknownValue exists but is not knownage of anonymous userPreserve NULL, add _is_known boolean
Not applicableField does not apply to this recordshipping_address for digital goodsCoalesce to 'N/A' or filter from dimension
System NULLSource system bug or migration artifactOrphaned foreign keysRoute to dead-letter queue for investigation

Practical Patterns

Coalesce with a sentinel value - Safe when NULLs mean “missing” and the sentinel will not appear naturally:

COALESCE(region, 'UNKNOWN') AS region

Add a boolean presence flag - Useful for nullable metrics where the NULL itself is meaningful:

discount_amount,
(discount_amount IS NOT NULL) AS has_discount

Dead-letter routing - When a NULL indicates a data quality problem that should not silently propagate:

-- Route to DLQ if required business key is NULL
CASE WHEN customer_id IS NULL
  THEN 'dead_letter'
  ELSE 'curated'
END AS output_topic

String Cleaning

Operational databases accumulate decades of inconsistent string data. Before strings reach analytics, apply a standard cleaning pipeline:

SELECT
  -- Whitespace normalization
  TRIM(company_name)                          AS company_name,

  -- Case normalization for lookups
  UPPER(TRIM(country_code))                   AS country_code,

  -- Remove non-printable characters
  REGEXP_REPLACE(notes, '[\\x00-\\x1F]', '') AS notes,

  -- Extract structured data from embedded JSON
  JSON_VALUE(metadata, '$.source')            AS acquisition_source,

  -- Standardize boolean-as-string patterns
  CASE LOWER(TRIM(is_active))
    WHEN 'true' THEN true
    WHEN '1'    THEN true
    WHEN 'yes'  THEN true
    ELSE false
  END AS is_active
FROM raw_customers;

Building Star Schemas from Change Streams

A star schema organizes data into fact tables (events, transactions) and dimension tables (customers, products, locations). Maintaining a star schema from a CDC stream requires handling upserts correctly.

The Upsert Problem

CDC streams emit three operation types: insert (c), update (u), and delete (d). Analytical stores generally do not support row-level updates natively (columnar formats are optimized for appends). The solutions:

Merge/Upsert targets - Apache Iceberg, Delta Lake, and Apache Hudi all support MERGE operations. A streaming job applies CDC events as upserts, maintaining a current-state table:

-- Iceberg MERGE for dimension upserts
MERGE INTO dim_customers AS target
USING (
  SELECT customer_id, name, email, region, event_time
  FROM customer_cdc_stream
  WHERE _cdc_operation IN ('c', 'u')
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET name = source.name, email = source.email,
             region = source.region, updated_at = source.event_time
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, email, region, created_at, updated_at)
  VALUES (source.customer_id, source.name, source.email,
          source.region, source.event_time, source.event_time);

Append-only with deduplication - Append all CDC events including operation type, then use a ROW_NUMBER() window function to select the latest state per key:

SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY event_time DESC
    ) AS rn
  FROM customer_events
  WHERE _cdc_operation != 'd'
) WHERE rn = 1;

Partitioning for Query Performance

Partition your fact tables by date (using event time, not processing time). This enables partition pruning - queries that filter by date range read only relevant partitions instead of the full table.

-- Partition fact table by event date
CREATE TABLE fact_orders (
  order_id      BIGINT,
  customer_id   BIGINT,
  product_id    BIGINT,
  amount        DECIMAL(18,2),
  event_time    TIMESTAMP,
  event_date    DATE  -- partition key, derived from event_time
)
PARTITIONED BY (event_date);

For dimension tables, partitioning is less critical because dimensions are small. Focus instead on clustering by the primary key so lookups are fast.


Where to Prepare Data in Your Architecture

Data preparation can happen at multiple points in the pipeline. The right answer depends on your latency requirements and team structure:

LayerTool ExamplesLatencyBest For
At the connectorKafka Connect SMTs, Debezium transformsMillisecondsSimple field renaming, routing
Stream processorApache Flink, Kafka StreamsSecondsJoins, aggregations, complex transforms
ELT in warehousedbt, SparkMinutes–hoursHeavy historical backfills
BI/query layerLooker LookML, MetabaseQuery timePresentation-only transforms

The general principle: apply corrections as early as possible, deferring only decisions that require warehouse-scale compute (complex deduplication across billions of rows) or business logic that changes frequently (better handled in dbt where analysts own it).

Platforms like Streamkap apply transformations at the connector layer before data lands in the destination, which reduces the volume of cleanup work needed downstream and keeps raw and curated data separate from the start.


A Practical Preparation Checklist

Before declaring a streaming dataset analytics-ready, verify:

  • All columns have explicit types - no implicit coercion at query time
  • All timestamps are in UTC and use event time, not processing time
  • NULL semantics are documented per column with a defined handling strategy
  • String columns are trimmed and have consistent casing for categorical values
  • CDC operation type is preserved as metadata but filtered from analytical queries
  • Schema changes are handled via a registry or versioning policy
  • Fact tables are partitioned by event date
  • Dimension tables support upserts (merge or deduplication view)
  • A data quality check runs on each batch (row counts, NULL rates, type errors)

Summary

The distance between raw CDC events and a reliable analytics dataset is not primarily a technology problem - it is a data modeling and discipline problem. The tools exist (stream processors, schema registries, lakehouse formats). The work is in applying them consistently: normalizing schemas before problems propagate, picking event time over processing time, documenting NULL semantics, and building upsert-capable dimension tables.

Teams that invest in this preparation layer early find that their downstream analytics are faster to build, easier to trust, and cheaper to maintain. Those that skip it spend most of their analytical capacity debugging data quality issues instead of answering business questions.

For related reading, see the guide on Change Data Capture fundamentals and the overview of streaming architectures for analytics.