<--- Back to all resources
Handling NULLs in Streaming Data: Strategies and Pitfalls
How to deal with NULL values in real-time streaming pipelines. Covers NULL semantics in Flink SQL, common bugs, default value strategies, and NULL-safe join patterns.
Every data engineer has a NULL horror story. Maybe it was the dashboard that showed zero revenue for an entire region because a join key was missing. Maybe it was the analytics query that returned NULL instead of a count, silently omitting thousands of rows from a business report. In batch processing, you eventually catch these problems during validation or QA. In streaming, NULLs arrive continuously, and by the time you notice the damage, they have already corrupted hours of downstream state.
This article is about treating NULLs as a first-class engineering concern in streaming pipelines. We will cover where NULLs come from, how they behave in Flink SQL, and what patterns you can use to handle them safely.
Where NULLs Come From in Streaming Data
NULLs in a streaming pipeline do not just appear because someone forgot to fill in a field. They show up for structural reasons that are built into how streaming systems work. Understanding the sources is the first step toward handling them.
Optional and Nullable Columns
The most obvious source. Your application schema declares a column as nullable, and some percentage of records arrive without a value. This is expected behavior, but it becomes a problem when downstream consumers assume the field is always populated. A last_login timestamp that is NULL for new users who have never logged in, for example, will break any query that computes time-since-last-login without a NULL check.
Schema Mismatches Across Pipeline Stages
When your source schema evolves and a new column gets added, records that were produced before the change will not have a value for that column. If your pipeline backfills or replays historical data, those older records arrive with NULLs in the new field. This is especially common in Kafka-based pipelines where the topic retains data across schema versions.
Failed Enrichments and Lookups
A common streaming pattern is to enrich an event by joining it against a reference table or an external API. When the lookup fails (the key does not exist, the service times out, the cache is cold), the enrichment fields come back as NULL. In batch, you can retry the entire dataset. In streaming, that record has already moved downstream.
CDC Delete Events
Change data capture (CDC) pipelines produce a specific kind of NULL that trips up many engineers. When a row is deleted from the source database, the CDC connector emits a tombstone or delete event. In many CDC formats (Debezium, for instance), the after payload of a delete event is NULL because the row no longer exists. If your pipeline does not explicitly handle delete operations, it will try to process a record where every field except the key is NULL.
Late-Arriving or Out-of-Order Data
In event-time processing, a record might arrive after its window has closed. Depending on your watermark strategy, the record may be dropped or processed with incomplete context. Fields that depend on windowed state (running counts, session identifiers) can end up NULL for these late arrivals.
Flink SQL NULL Semantics: What You Need to Know
If you are writing streaming transformations in Flink SQL, you need to internalize one fact: Flink follows standard SQL three-valued logic. Every comparison involving NULL returns UNKNOWN, not TRUE or FALSE. This single rule is the root cause of most NULL-related bugs in streaming pipelines.
Three-Valued Logic in Practice
Consider a filter:
SELECT * FROM orders WHERE discount > 0
If discount is NULL for a given row, the expression NULL > 0 evaluates to UNKNOWN. The WHERE clause only passes rows where the condition is TRUE. So the row is silently dropped. It is not included in the result, and there is no error or warning.
This also applies to equality checks:
SELECT * FROM orders WHERE region = 'US'
If region is NULL, NULL = 'US' is UNKNOWN, and the row is excluded. More dangerously:
SELECT * FROM orders WHERE region <> 'US'
This also excludes the NULL row, because NULL <> 'US' is UNKNOWN too. The row is neither equal to ‘US’ nor not equal to ‘US’. It is in a third state that most application developers do not expect.
NULL in Boolean Expressions
The three-valued logic extends to AND, OR, and NOT:
TRUE AND NULL= UNKNOWNFALSE OR NULL= UNKNOWNNOT NULL= UNKNOWN
This means that compound filters can produce surprising results. A query like WHERE status = 'active' AND region = 'US' will drop any row where either field is NULL, even if the other field matches perfectly.
IS NULL and IS NOT NULL
The only safe way to test for NULLs in Flink SQL is with the IS NULL and IS NOT NULL predicates. These return TRUE or FALSE, never UNKNOWN:
SELECT * FROM orders WHERE region IS NOT NULL AND region <> 'US'
This pattern explicitly handles the NULL case instead of leaving it to three-valued logic.
NULL-Safe Patterns for Common Operations
Knowing the semantics is half the battle. The other half is building patterns that handle NULLs correctly in every stage of your pipeline.
COALESCE for Default Values
COALESCE returns the first non-NULL argument from its list. It is the workhorse of NULL handling:
SELECT
order_id,
COALESCE(discount, 0) AS discount,
COALESCE(region, 'UNKNOWN') AS region,
COALESCE(customer_name, 'Anonymous') AS customer_name
FROM orders
Use COALESCE when you have a sensible default value and the downstream consumer cannot handle NULLs. But be deliberate about your defaults. Replacing a NULL discount with 0 is semantically different from replacing a NULL region with ‘UNKNOWN’. The first says “no discount was applied.” The second says “we do not know the region.” Make sure your default values carry the right meaning.
IFNULL as a Shorthand
Flink SQL also supports IFNULL(expr, default), which is equivalent to COALESCE with exactly two arguments. It reads more clearly when you have a single fallback:
SELECT order_id, IFNULL(shipping_cost, 0.00) AS shipping_cost FROM orders
IS NOT NULL Filters
When records with NULLs in certain fields are genuinely invalid and should be discarded, filter them explicitly:
SELECT * FROM orders WHERE customer_id IS NOT NULL
This is better than relying on a downstream join to implicitly drop them, because it makes the intent clear and you can add monitoring on the discard rate.
CASE Expressions for Complex Logic
When the handling logic depends on context, use CASE:
SELECT
order_id,
CASE
WHEN discount IS NULL AND coupon_code IS NOT NULL THEN 0.10
WHEN discount IS NULL THEN 0
ELSE discount
END AS effective_discount
FROM orders
This lets you distinguish between “discount is missing because we have not computed it yet” and “discount is missing because there is no discount.”
Joins with NULLs: The Silent Data Killer
Joins are where NULLs do the most damage in streaming pipelines, and they do it quietly.
The Core Problem
In an inner join, both sides must match on the join key. Since NULL = NULL evaluates to UNKNOWN (not TRUE), any row with a NULL join key is dropped from the result. No error. No log message. The row just vanishes.
SELECT o.*, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
If o.customer_id is NULL, this order disappears from the output. In a batch job, you might catch this in a row count reconciliation. In a streaming pipeline, the row is gone before you know it existed.
Left Outer Joins
A left outer join preserves the left-side row when there is no match, but the right-side columns are filled with NULLs. This is often the safer choice in streaming because it prevents silent data loss:
SELECT o.*, c.name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
Now orders with NULL customer_id are preserved in the output, and c.name is NULL. You can then handle these rows downstream with COALESCE or a filter.
NULL-Safe Equality
Some SQL dialects offer a NULL-safe equality operator (MySQL’s <=>, for example), but Flink SQL does not have one. You can simulate it with a CASE expression or by using IS NOT DISTINCT FROM, which Flink does support in recent versions:
SELECT *
FROM table_a a
JOIN table_b b ON a.key IS NOT DISTINCT FROM b.key
Be cautious with this pattern. If both keys are NULL, the join will match, which may or may not be what you want. Usually, it is better to filter out NULL keys before the join and handle them separately.
Aggregation Behavior with NULLs
Aggregation functions in Flink SQL follow standard SQL rules, but those rules have sharp edges when NULLs are involved.
How Each Function Handles NULLs
- COUNT(*) counts all rows, including those with NULL values in every column.
- COUNT(column) counts only non-NULL values in the specified column. This distinction trips people up constantly.
- SUM(column) skips NULLs. If some values are 10, 20, and NULL, the SUM is 30, not NULL. But if all values in the group are NULL, SUM returns NULL, not 0.
- AVG(column) skips NULLs in both the numerator and the denominator. If you have values 10, NULL, 30, the AVG is 20 (not 13.33).
- MIN/MAX skip NULLs. If all values are NULL, they return NULL.
The All-NULLs Trap
The most common aggregation bug is a group where every value is NULL. For example, if you are computing revenue by region and an entire region has NULL revenue values (perhaps due to a failed enrichment), the SUM returns NULL for that region. If a downstream report uses this value in a calculation, the NULL propagates and corrupts the entire computation.
Defend against this with COALESCE at the aggregation level:
SELECT
region,
COALESCE(SUM(revenue), 0) AS total_revenue,
COUNT(*) AS order_count,
COUNT(revenue) AS orders_with_revenue
FROM orders
GROUP BY region
The difference between COUNT(*) and COUNT(revenue) gives you an immediate signal of how many NULLs exist in each group.
CDC-Specific NULL Patterns
If your streaming pipeline ingests data via CDC, you face a specific set of NULL challenges that do not exist in other streaming patterns.
Delete Events
As mentioned earlier, a delete event in CDC typically has a NULL after payload. If your transformation logic blindly accesses after.column_name, it will fail or produce NULLs. The correct pattern is to check the operation type first:
SELECT
CASE
WHEN op = 'd' THEN before.customer_id
ELSE after.customer_id
END AS customer_id,
op
FROM cdc_events
In Streamkap, CDC events are normalized so that the operation type is always accessible, and the pipeline can route deletes to the appropriate handling logic rather than letting them silently produce NULLs downstream.
Optional Columns and Schema Evolution
When a source database adds a new nullable column, existing rows have NULL for that column until they are updated. In a CDC pipeline, those rows will not emit a change event just because a column was added. You will only see values for the new column on rows that are inserted or updated after the schema change. This means your streaming pipeline will have a mix of records with and without the new field for an extended period.
Streamkap’s automatic schema evolution handles the destination-side propagation of new columns, but your transformation logic still needs to account for NULLs in the new field until the source data catches up.
Soft Deletes vs. Hard Deletes
Some applications use soft deletes (setting a deleted_at timestamp) rather than actually removing rows. In this case, the CDC event is an UPDATE, not a DELETE, and the after payload is fully populated. But now the deleted_at field transitions from NULL to a timestamp. Your downstream logic needs to distinguish between “this field is NULL because the record is active” and “this field has a value because the record was deleted.” The NULL itself carries meaning.
Monitoring NULL Rates
Detection is just as important as handling. You should track NULL rates as a data quality metric in your streaming pipeline. Here is a practical approach.
Compute NULL Percentages Per Field
SELECT
COUNT(*) AS total_records,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
SUM(CASE WHEN revenue IS NULL THEN 1 ELSE 0 END) AS null_revenue,
SUM(CASE WHEN region IS NULL THEN 1 ELSE 0 END) AS null_region
FROM orders
Run this as a continuous query in a tumbling window (every 5 minutes, for example) and sink the results to a monitoring dashboard. When the NULL rate for a field spikes, it is an early warning that something has changed upstream.
Set Alerts on Thresholds
Define acceptable NULL rates for each field based on business context. A coupon_code field might legitimately be NULL 80% of the time. A customer_id field should never be NULL. When the rate crosses the threshold, fire an alert. This catches problems like failed enrichments, schema changes, or upstream bugs before they propagate to business-critical reports.
In Streamkap, you can monitor pipeline health metrics and set up alerts that catch data quality regressions in real time, rather than discovering them in a weekly data audit.
Putting It All Together: A NULL Handling Policy
The patterns above are most effective when they are codified into a policy that your team agrees on before building the pipeline, not after a production incident.
A practical NULL handling policy covers four areas:
-
Classification. For every field in your schema, decide whether NULL is a valid state or a data quality defect. Document this decision.
-
Transformation rules. For fields where NULL is valid, define the default value (COALESCE) or the handling logic (CASE). For fields where NULL is a defect, define whether the record should be filtered, dead-lettered, or flagged.
-
Monitoring. Track NULL rates per field and set alerts on anomalous changes. Use the difference between
COUNT(*)andCOUNT(column)as a cheap, always-available signal. -
Testing. Include NULL records in your test data. Every transformation should be tested with NULLs in every nullable field, NULLs in join keys, and all-NULL groups in aggregations.
NULLs are not a special case. They are a permanent feature of any data system that deals with the real world. Optional fields exist. Lookups fail. Schemas evolve. Records get deleted. The question is not whether your streaming pipeline will encounter NULLs, but whether it will handle them with intention or by accident. Build the policy early, enforce it in your transformations, monitor it in production, and NULLs become just another data type you know how to work with.