<--- Back to all resources
Data Quality in Streaming Pipelines: A Practical Framework
A practical framework for maintaining data quality in real-time streaming pipelines. Covers validation, schema enforcement, anomaly detection, dead letter queues, and monitoring.
Bad data in a streaming pipeline does not sit still waiting to be corrected. It propagates downstream in real time, contaminating dashboards, triggering false alerts, training ML models on incorrect examples, and - in the worst cases - causing automated systems to take wrong actions.
The discipline of data quality in streaming pipelines requires a different mindset than batch data quality. The emphasis shifts from retrospective detection to proactive prevention, continuous monitoring, and graceful degradation when prevention fails.
This guide provides a practical framework: the six quality dimensions that matter for streams, the strategies for enforcing each one, and the operational infrastructure that holds it all together.
The Six Dimensions of Streaming Data Quality
1. Accuracy
A record is accurate if its values correctly represent the real-world state they are supposed to capture. Accuracy violations include unit mismatches (a price in cents recorded as dollars), sign errors (a withdrawal recorded as a deposit), and stale reference data (a product category that was correct last month but has since changed).
Accuracy is the hardest dimension to validate automatically because it requires knowing what “correct” looks like. Strategies include:
- Cross-system reconciliation: Compare aggregates between the source system and the stream destination. If the sum of orders in PostgreSQL and the sum of order events in your data warehouse diverge, investigate.
- Business rule validation: Encode known constraints - prices must be positive, quantities must be integers, timestamps cannot be in the future - and flag violations.
- Anomaly detection: Flag records whose values fall outside historical statistical ranges.
2. Completeness
Completeness measures whether required fields are present. A record missing its user_id cannot be attributed. A transaction without a timestamp cannot be placed in time. An event without an amount is useless for revenue reporting.
Completeness checks are straightforward to implement and should be automated:
-- Flink SQL: route incomplete records to DLQ
INSERT INTO events_dlq
SELECT *, 'INCOMPLETE: missing required fields' AS failure_reason
FROM raw_events
WHERE user_id IS NULL
OR event_type IS NULL
OR event_time IS NULL;
-- Route complete records to main flow
INSERT INTO validated_events
SELECT *
FROM raw_events
WHERE user_id IS NOT NULL
AND event_type IS NOT NULL
AND event_time IS NOT NULL;
Track null rates per field as a continuous metric. A field that is 0.01% null in steady state is alarming when it jumps to 15% - that is a signal of an upstream producer change, not a normal data pattern.
3. Consistency
Consistency means that data values agree across systems and over time. If the same user has account_tier = 'enterprise' in the CRM but account_tier = 'starter' in the event stream, there is a consistency problem. If a CDC stream shows an UPDATE to a record that was never INSERTed, the stream is internally inconsistent.
Consistency issues often arise at integration boundaries - where two systems maintain their own copies of the same information and can drift out of sync.
Referential integrity checks in Flink SQL:
-- Flag events that reference users not present in the user dimension
SELECT
e.*,
'CONSISTENCY: unknown user_id' AS failure_reason
FROM events e
LEFT JOIN user_dimension u ON e.user_id = u.user_id
WHERE u.user_id IS NULL;
Note that in a streaming context, this check can produce false positives: if an event arrives slightly before the corresponding user record, it will appear to reference an unknown user. Use a brief delay or a lookup cache with a short TTL to reduce false positives without masking real consistency problems.
4. Timeliness
A record is timely if it arrives within an acceptable window of when the underlying event occurred. In batch processing, timeliness is rarely a first-class concern - data arrives daily and everyone accepts the T+1 lag. In streaming, timeliness is often the core value proposition.
Timeliness has two components:
End-to-end latency: How long does it take from event occurrence to the record being available in the destination? Monitor this continuously. A sudden jump in latency often precedes a full pipeline stall.
Out-of-order arrival: Events do not always arrive in the order they occurred. A mobile app might buffer events during offline periods and flush them when connectivity is restored, producing a burst of late events. Your pipeline needs a defined policy for late events:
- Accept with watermark lag: Flink’s watermark mechanism allows late events up to a configured delay to be included in their correct window.
- Route to correction queue: Events beyond the watermark can be routed to a late-events topic, which a separate job reconciles periodically.
- Discard with logging: In cases where late events truly cannot be used, discard them but log the count and the lag distribution for investigation.
Measuring event-time lag in Flink SQL:
SELECT
TUMBLE_START(proc_time, INTERVAL '1' MINUTE) AS window_start,
AVG(TIMESTAMPDIFF(SECOND, event_time, CAST(proc_time AS TIMESTAMP(3)))) AS avg_lag_seconds,
MAX(TIMESTAMPDIFF(SECOND, event_time, CAST(proc_time AS TIMESTAMP(3)))) AS max_lag_seconds,
COUNT(*) AS record_count
FROM raw_events
GROUP BY TUMBLE(proc_time, INTERVAL '1' MINUTE);
5. Validity
Validity means that values conform to defined domain rules: correct format, acceptable range, expected enumeration. An email field containing "N/A" is not valid. A status field containing "REFUNDED_PENDING" when the only valid values are "PENDING", "COMPLETED", and "REFUNDED" is not valid.
Validity checks are highly domain-specific but follow a common structure:
-- Flink SQL validity checks
INSERT INTO validated_orders
SELECT *
FROM raw_orders
WHERE
-- Range checks
amount > 0 AND amount < 1000000
-- Enum checks
AND status IN ('PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED', 'REFUNDED')
-- Format checks
AND order_id RLIKE '^ORD-[0-9]{8}$'
-- Logical consistency
AND (completed_at IS NULL OR completed_at >= created_at);
-- Invalid records to DLQ
INSERT INTO orders_dlq
SELECT *, 'VALIDITY: failed domain rules' AS failure_reason
FROM raw_orders
WHERE NOT (
amount > 0 AND amount < 1000000
AND status IN ('PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED', 'REFUNDED')
AND order_id RLIKE '^ORD-[0-9]{8}$'
AND (completed_at IS NULL OR completed_at >= created_at)
);
6. Uniqueness
Uniqueness ensures that records which should be distinct are not duplicated. Duplicate events are common in streaming systems: at-least-once delivery guarantees mean a record can be delivered more than once. Network retries, application retries on failure, and Kafka consumer restarts all generate duplicates.
Deduplication strategies depend on the acceptable time window and the key:
-- Streaming deduplication with row_number
-- Keeps the first occurrence of each event_id seen in the last 24 hours
SELECT event_id, user_id, event_type, event_time
FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY event_time ASC
) AS rn
FROM raw_events
)
WHERE rn = 1;
For long-time-horizon deduplication (days or weeks), a row-number approach accumulates unbounded state. In those cases, consider using an external key-value store (Redis, DynamoDB) to track seen keys, queried via a Flink UDF.
Schema Enforcement: The First Line of Defense
Schema enforcement prevents structurally malformed records from entering the pipeline at all. It is the highest-impact quality control because it operates at the point of ingestion, before any downstream system has seen the data.
Schema Registry Integration
A schema registry stores the canonical schema for each Kafka topic and enforces compatibility rules when producers attempt to register new schemas. The Confluent Schema Registry and AWS Glue Schema Registry both follow this model.
Compatibility modes:
| Mode | What It Allows | What It Prevents |
|---|---|---|
| BACKWARD | New schema can read data written with old schema | Removing required fields, changing field types |
| FORWARD | Old schema can read data written with new schema | Adding required fields |
| FULL | Both backward and forward | Most schema changes |
| NONE | Any change | Nothing - dangerous for production |
For most production use cases, BACKWARD compatibility is the right choice: new consumers can read old data, enabling rolling upgrades without coordination.
Registering a schema via the Confluent Schema Registry API:
curl -X POST http://schema-registry:8081/subjects/orders-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{
"schema": "{\"type\":\"record\",\"name\":\"Order\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"amount\",\"type\":\"double\"},{\"name\":\"status\",\"type\":\"string\"}]}"
}'
Dead Letter Queues
A dead letter queue (DLQ) is the safety valve that makes all other quality checks safe to implement aggressively. Without a DLQ, aggressive validation either:
- Causes the pipeline to halt (unacceptable availability)
- Silently drops bad records (unacceptable for auditability)
With a DLQ, bad records are routed to a separate topic with metadata about why they failed. The main pipeline continues uninterrupted, and the DLQ contents can be investigated, corrected, and replayed.
DLQ record structure:
{
"original_topic": "orders",
"original_partition": 3,
"original_offset": 18472,
"failure_timestamp": "2026-02-25T14:32:11Z",
"failure_reason": "VALIDITY: amount=-50.00 fails range check",
"failure_stage": "validation",
"original_payload": { ... }
}
Always include enough context in the DLQ record to:
- Identify the original record’s position in the source topic
- Understand why it failed
- Replay it to the original pipeline after correction
Anomaly Detection on Streams
Rule-based validity checks catch known bad patterns. Anomaly detection catches unknown bad patterns - deviations from the statistical normal that indicate something unexpected has happened.
Statistical Process Control
Track a rolling mean and standard deviation for key metrics, and alert when values exceed N standard deviations from the mean:
-- Detect anomalous transaction amounts using rolling statistics
SELECT
window_start,
window_end,
avg_amount,
stddev_amount,
CASE
WHEN ABS(current_avg - avg_amount) > 3 * stddev_amount THEN 'ANOMALY'
ELSE 'NORMAL'
END AS status
FROM (
SELECT
TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
TUMBLE_END(event_time, INTERVAL '5' MINUTE) AS window_end,
AVG(amount) AS current_avg
FROM transactions
GROUP BY TUMBLE(event_time, INTERVAL '5' MINUTE)
) current_window
CROSS JOIN (
SELECT AVG(amount) AS avg_amount, STDDEV(amount) AS stddev_amount
FROM transactions
WHERE event_time > NOW() - INTERVAL '7' DAY
) historical_stats;
Volume Anomalies
A sudden drop in event volume is often more alarming than a sudden spike. A drop can indicate a producer failure, a network partition, or a deployment that broke event emission.
Monitor events-per-minute continuously and alert when volume drops below a threshold for more than N consecutive minutes. The threshold should be relative to the historical baseline for that time of day (accounting for daily and weekly seasonality).
Monitoring and Alerting
Data quality monitoring for streaming pipelines requires a different set of metrics than infrastructure monitoring.
Key Quality Metrics
| Metric | What It Reveals | Alert Condition |
|---|---|---|
| Null rate per field | Schema drift, producer bugs | >2x baseline in 15 min |
| DLQ ingestion rate | Validation failure rate | Any non-zero rate above baseline |
| Schema violation count | Breaking schema changes | Any non-zero in production |
| Consumer lag | Pipeline throughput issues | Sustained growth >10 min |
| Event-time lag | Late event rate | p99 lag > SLA threshold |
| Record count rate | Source availability | Drop >20% below baseline |
| Deduplication rate | Upstream retry storms | Spike above baseline |
Embedding Quality Metadata
A useful pattern is to attach quality metadata to records as they pass through validation, rather than only routing failures to a DLQ:
INSERT INTO enriched_events
SELECT
*,
CASE
WHEN user_id IS NULL THEN 'WARN:null_user_id'
WHEN amount < 0 THEN 'WARN:negative_amount'
WHEN event_time < NOW() - INTERVAL '1' HOUR THEN 'WARN:late_event'
ELSE 'OK'
END AS quality_flag
FROM raw_events;
Downstream consumers can then filter on quality_flag = 'OK' for high-confidence analysis, or include WARN records with appropriate caveats for exploratory work.
Contrasting Batch and Streaming Data Quality
| Aspect | Batch | Streaming |
|---|---|---|
| When checks run | After load | Continuously, in real time |
| Impact of bad data | Corrupt a batch that can be rerun | Propagate to live consumers immediately |
| Remediation | Delete bad rows, rerun transformation | DLQ + replay, or compensating events |
| Schema evolution | Coordinated, infrequent | Continuous, must be backward-compatible |
| Latency tolerance | Hours (T+1 batch) | Seconds to minutes |
| Anomaly detection | Threshold on daily totals | Rolling window on recent data |
The core shift is that in streaming, you cannot rely on “fix it before anyone sees it.” Data has already been consumed by the time a problem is detected. This makes prevention - schema enforcement, producer-side validation, DLQ routing - more valuable than detection.
A Production Quality Checklist
Before a streaming pipeline goes live, verify:
- Schema registered in schema registry with appropriate compatibility mode
- Producer validates schema before writing
- Null rate checks configured for all required fields
- Business rule validations encoded in Flink SQL or transform layer
- DLQ topic created, accessible, and monitored
- DLQ records include enough metadata for replay
- Volume alerting configured (drop and spike)
- Consumer lag alerting configured
- End-to-end latency SLA defined and monitored
- Late event policy documented and implemented
- Deduplication strategy in place for at-least-once sources
- Runbook exists for each alert condition
Data quality in streaming pipelines is not a feature you add at the end - it is a property you design in from the beginning. Tools like Streamkap expose pipeline-level quality controls (field validation, null handling, DLQ routing) as first-class configuration, making it practical to enforce quality standards without building custom validation infrastructure.
For guidance on how transformations fit into the quality story, see Stream Data Transformation: Patterns for Shaping Data in Real Time.