Why do NULLs cause more problems in streaming than in batch?

In batch processing, you can inspect the full dataset and fix NULL issues before loading. In streaming, bad records arrive continuously and can silently corrupt aggregations, joins, and downstream systems before anyone notices. A NULL in a join key drops the record. A NULL in a SUM returns NULL for the entire group.

How does Flink SQL handle NULLs in aggregations?

Flink follows standard SQL semantics. Most aggregate functions (SUM, AVG, COUNT with column) skip NULL values. COUNT(*) counts all rows including NULLs. If all values in a group are NULL, SUM and AVG return NULL. Use COALESCE to substitute defaults before aggregating if you want different behavior.

What happens when a NULL appears in a join key?

The row is dropped from the join result in an inner join because NULL = NULL evaluates to UNKNOWN, not TRUE. In a left outer join, the row from the left side is preserved but gets NULLs for all right-side columns. This is one of the most common causes of silent data loss in streaming pipelines.

Should I replace NULLs with default values?

It depends on the semantics. For numeric fields used in aggregations, COALESCE to 0 is often correct. For string fields, an empty string may be appropriate. But replacing NULLs can hide data quality issues - a NULL might indicate a bug upstream. Consider logging NULL rates as a data quality metric before replacing them.

<--- Back to all resources

Data Quality & Governance

February 25, 2026

9 min read

Handling NULLs in Streaming Data: Strategies and Pitfalls

How to deal with NULL values in real-time streaming pipelines. Covers NULL semantics in Flink SQL, common bugs, default value strategies, and NULL-safe join patterns.

TL;DR: • NULLs in streaming data come from optional fields, schema mismatches, failed enrichments, and CDC delete events. • Flink SQL follows standard SQL NULL semantics - NULL in a comparison returns UNKNOWN, not FALSE. • NULL join keys cause silent data loss because NULL = NULL evaluates to UNKNOWN. • Define explicit NULL handling policies early and enforce them at the transformation layer.

Every data engineer has a NULL horror story. Maybe it was the dashboard that showed zero revenue for an entire region because a join key was missing. Maybe it was the analytics query that returned NULL instead of a count, silently omitting thousands of rows from a business report. In batch processing, you eventually catch these problems during validation or QA. In streaming, NULLs arrive continuously, and by the time you notice the damage, they have already corrupted hours of downstream state.

This article is about treating NULLs as a first-class engineering concern in streaming pipelines. We will cover where NULLs come from, how they behave in Flink SQL, and what patterns you can use to handle them safely.

Where NULLs Come From in Streaming Data

NULLs in a streaming pipeline do not just appear because someone forgot to fill in a field. They show up for structural reasons that are built into how streaming systems work. Understanding the sources is the first step toward handling them.

Optional and Nullable Columns

The most obvious source. Your application schema declares a column as nullable, and some percentage of records arrive without a value. This is expected behavior, but it becomes a problem when downstream consumers assume the field is always populated. A last_login timestamp that is NULL for new users who have never logged in, for example, will break any query that computes time-since-last-login without a NULL check.

Schema Mismatches Across Pipeline Stages

When your source schema evolves and a new column gets added, records that were produced before the change will not have a value for that column. If your pipeline backfills or replays historical data, those older records arrive with NULLs in the new field. This is especially common in Kafka-based pipelines where the topic retains data across schema versions.

Failed Enrichments and Lookups

A common streaming pattern is to enrich an event by joining it against a reference table or an external API. When the lookup fails (the key does not exist, the service times out, the cache is cold), the enrichment fields come back as NULL. In batch, you can retry the entire dataset. In streaming, that record has already moved downstream.

CDC Delete Events

Change data capture (CDC) pipelines produce a specific kind of NULL that trips up many engineers. When a row is deleted from the source database, the CDC connector emits a tombstone or delete event. In many CDC formats (Debezium, for instance), the after payload of a delete event is NULL because the row no longer exists. If your pipeline does not explicitly handle delete operations, it will try to process a record where every field except the key is NULL.

Late-Arriving or Out-of-Order Data

In event-time processing, a record might arrive after its window has closed. Depending on your watermark strategy, the record may be dropped or processed with incomplete context. Fields that depend on windowed state (running counts, session identifiers) can end up NULL for these late arrivals.

Flink SQL NULL Semantics: What You Need to Know

If you are writing streaming transformations in Flink SQL, you need to internalize one fact: Flink follows standard SQL three-valued logic. Every comparison involving NULL returns UNKNOWN, not TRUE or FALSE. This single rule is the root cause of most NULL-related bugs in streaming pipelines.

Three-Valued Logic in Practice

Consider a filter:

SELECT * FROM orders WHERE discount > 0

If discount is NULL for a given row, the expression NULL > 0 evaluates to UNKNOWN. The WHERE clause only passes rows where the condition is TRUE. So the row is silently dropped. It is not included in the result, and there is no error or warning.

This also applies to equality checks:

SELECT * FROM orders WHERE region = 'US'

If region is NULL, NULL = 'US' is UNKNOWN, and the row is excluded. More dangerously:

SELECT * FROM orders WHERE region <> 'US'

This also excludes the NULL row, because NULL <> 'US' is UNKNOWN too. The row is neither equal to ‘US’ nor not equal to ‘US’. It is in a third state that most application developers do not expect.

NULL in Boolean Expressions

The three-valued logic extends to AND, OR, and NOT:

TRUE AND NULL = UNKNOWN
FALSE OR NULL = UNKNOWN
NOT NULL = UNKNOWN

This means that compound filters can produce surprising results. A query like WHERE status = 'active' AND region = 'US' will drop any row where either field is NULL, even if the other field matches perfectly.

IS NULL and IS NOT NULL

The only safe way to test for NULLs in Flink SQL is with the IS NULL and IS NOT NULL predicates. These return TRUE or FALSE, never UNKNOWN:

SELECT * FROM orders WHERE region IS NOT NULL AND region <> 'US'

This pattern explicitly handles the NULL case instead of leaving it to three-valued logic.

NULL-Safe Patterns for Common Operations

Knowing the semantics is half the battle. The other half is building patterns that handle NULLs correctly in every stage of your pipeline.

COALESCE for Default Values

COALESCE returns the first non-NULL argument from its list. It is the workhorse of NULL handling:

SELECT
  order_id,
  COALESCE(discount, 0) AS discount,
  COALESCE(region, 'UNKNOWN') AS region,
  COALESCE(customer_name, 'Anonymous') AS customer_name
FROM orders

Use COALESCE when you have a sensible default value and the downstream consumer cannot handle NULLs. But be deliberate about your defaults. Replacing a NULL discount with 0 is semantically different from replacing a NULL region with ‘UNKNOWN’. The first says “no discount was applied.” The second says “we do not know the region.” Make sure your default values carry the right meaning.

IFNULL as a Shorthand

Flink SQL also supports IFNULL(expr, default), which is equivalent to COALESCE with exactly two arguments. It reads more clearly when you have a single fallback:

SELECT order_id, IFNULL(shipping_cost, 0.00) AS shipping_cost FROM orders

IS NOT NULL Filters

When records with NULLs in certain fields are genuinely invalid and should be discarded, filter them explicitly:

SELECT * FROM orders WHERE customer_id IS NOT NULL

This is better than relying on a downstream join to implicitly drop them, because it makes the intent clear and you can add monitoring on the discard rate.

CASE Expressions for Complex Logic

When the handling logic depends on context, use CASE:

SELECT
  order_id,
  CASE
    WHEN discount IS NULL AND coupon_code IS NOT NULL THEN 0.10
    WHEN discount IS NULL THEN 0
    ELSE discount
  END AS effective_discount
FROM orders

This lets you distinguish between “discount is missing because we have not computed it yet” and “discount is missing because there is no discount.”

Joins with NULLs: The Silent Data Killer

Joins are where NULLs do the most damage in streaming pipelines, and they do it quietly.

The Core Problem

In an inner join, both sides must match on the join key. Since NULL = NULL evaluates to UNKNOWN (not TRUE), any row with a NULL join key is dropped from the result. No error. No log message. The row just vanishes.

SELECT o.*, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id

If o.customer_id is NULL, this order disappears from the output. In a batch job, you might catch this in a row count reconciliation. In a streaming pipeline, the row is gone before you know it existed.

Left Outer Joins

A left outer join preserves the left-side row when there is no match, but the right-side columns are filled with NULLs. This is often the safer choice in streaming because it prevents silent data loss:

SELECT o.*, c.name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id

Now orders with NULL customer_id are preserved in the output, and c.name is NULL. You can then handle these rows downstream with COALESCE or a filter.

NULL-Safe Equality

Some SQL dialects offer a NULL-safe equality operator (MySQL’s <=>, for example), but Flink SQL does not have one. You can simulate it with a CASE expression or by using IS NOT DISTINCT FROM, which Flink does support in recent versions:

SELECT *
FROM table_a a
JOIN table_b b ON a.key IS NOT DISTINCT FROM b.key

Be cautious with this pattern. If both keys are NULL, the join will match, which may or may not be what you want. Usually, it is better to filter out NULL keys before the join and handle them separately.

Aggregation Behavior with NULLs

Aggregation functions in Flink SQL follow standard SQL rules, but those rules have sharp edges when NULLs are involved.

How Each Function Handles NULLs

COUNT(*) counts all rows, including those with NULL values in every column.
COUNT(column) counts only non-NULL values in the specified column. This distinction trips people up constantly.
SUM(column) skips NULLs. If some values are 10, 20, and NULL, the SUM is 30, not NULL. But if all values in the group are NULL, SUM returns NULL, not 0.
AVG(column) skips NULLs in both the numerator and the denominator. If you have values 10, NULL, 30, the AVG is 20 (not 13.33).
MIN/MAX skip NULLs. If all values are NULL, they return NULL.

The All-NULLs Trap

The most common aggregation bug is a group where every value is NULL. For example, if you are computing revenue by region and an entire region has NULL revenue values (perhaps due to a failed enrichment), the SUM returns NULL for that region. If a downstream report uses this value in a calculation, the NULL propagates and corrupts the entire computation.

Defend against this with COALESCE at the aggregation level:

SELECT
  region,
  COALESCE(SUM(revenue), 0) AS total_revenue,
  COUNT(*) AS order_count,
  COUNT(revenue) AS orders_with_revenue
FROM orders
GROUP BY region

The difference between COUNT(*) and COUNT(revenue) gives you an immediate signal of how many NULLs exist in each group.

CDC-Specific NULL Patterns

If your streaming pipeline ingests data via CDC, you face a specific set of NULL challenges that do not exist in other streaming patterns.

Delete Events

As mentioned earlier, a delete event in CDC typically has a NULL after payload. If your transformation logic blindly accesses after.column_name, it will fail or produce NULLs. The correct pattern is to check the operation type first:

SELECT
  CASE
    WHEN op = 'd' THEN before.customer_id
    ELSE after.customer_id
  END AS customer_id,
  op
FROM cdc_events

In Streamkap, CDC events are normalized so that the operation type is always accessible, and the pipeline can route deletes to the appropriate handling logic rather than letting them silently produce NULLs downstream.

Optional Columns and Schema Evolution

When a source database adds a new nullable column, existing rows have NULL for that column until they are updated. In a CDC pipeline, those rows will not emit a change event just because a column was added. You will only see values for the new column on rows that are inserted or updated after the schema change. This means your streaming pipeline will have a mix of records with and without the new field for an extended period.

Streamkap’s automatic schema evolution handles the destination-side propagation of new columns, but your transformation logic still needs to account for NULLs in the new field until the source data catches up.

Soft Deletes vs. Hard Deletes

Some applications use soft deletes (setting a deleted_at timestamp) rather than actually removing rows. In this case, the CDC event is an UPDATE, not a DELETE, and the after payload is fully populated. But now the deleted_at field transitions from NULL to a timestamp. Your downstream logic needs to distinguish between “this field is NULL because the record is active” and “this field has a value because the record was deleted.” The NULL itself carries meaning.

Monitoring NULL Rates

Detection is just as important as handling. You should track NULL rates as a data quality metric in your streaming pipeline. Here is a practical approach.

Compute NULL Percentages Per Field

SELECT
  COUNT(*) AS total_records,
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
  SUM(CASE WHEN revenue IS NULL THEN 1 ELSE 0 END) AS null_revenue,
  SUM(CASE WHEN region IS NULL THEN 1 ELSE 0 END) AS null_region
FROM orders

Run this as a continuous query in a tumbling window (every 5 minutes, for example) and sink the results to a monitoring dashboard. When the NULL rate for a field spikes, it is an early warning that something has changed upstream.

Set Alerts on Thresholds

Define acceptable NULL rates for each field based on business context. A coupon_code field might legitimately be NULL 80% of the time. A customer_id field should never be NULL. When the rate crosses the threshold, fire an alert. This catches problems like failed enrichments, schema changes, or upstream bugs before they propagate to business-critical reports.

In Streamkap, you can monitor pipeline health metrics and set up alerts that catch data quality regressions in real time, rather than discovering them in a weekly data audit.

Putting It All Together: A NULL Handling Policy

The patterns above are most effective when they are codified into a policy that your team agrees on before building the pipeline, not after a production incident.

A practical NULL handling policy covers four areas:

Classification. For every field in your schema, decide whether NULL is a valid state or a data quality defect. Document this decision.
Transformation rules. For fields where NULL is valid, define the default value (COALESCE) or the handling logic (CASE). For fields where NULL is a defect, define whether the record should be filtered, dead-lettered, or flagged.
Monitoring. Track NULL rates per field and set alerts on anomalous changes. Use the difference between COUNT(*) and COUNT(column) as a cheap, always-available signal.
Testing. Include NULL records in your test data. Every transformation should be tested with NULLs in every nullable field, NULLs in join keys, and all-NULL groups in aggregations.

NULLs are not a special case. They are a permanent feature of any data system that deals with the real world. Optional fields exist. Lookups fail. Schemas evolve. Records get deleted. The question is not whether your streaming pipeline will encounter NULLs, but whether it will handle them with intention or by accident. Build the policy early, enforce it in your transformations, monitor it in production, and NULLs become just another data type you know how to work with.

Ready to build streaming pipelines with built-in data quality controls? Streamkap provides managed CDC with automatic schema evolution and pipeline health monitoring. Start a free trial or learn more about our capabilities.

Products

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company