How is data quality in streaming pipelines different from batch data quality?

In batch processing, quality checks run after data has already been loaded - you detect problems after the fact and fix them in place. In streaming, data is continuously flowing to consumers in real time. By the time a quality problem is detected, downstream systems may have already acted on bad data. This shifts the focus from detection-and-repair to prevention-and-routing: stop bad data from propagating in the first place, and route it to a dead letter queue for investigation rather than silently discarding or corrupting the stream.

What is a dead letter queue and when should I use one?

A dead letter queue (DLQ) is a secondary destination for records that fail processing - typically due to schema violations, failed transformations, or validation errors. Rather than halting the pipeline or silently dropping bad records, you route them to the DLQ with metadata about why they failed. Every production streaming pipeline should have a DLQ strategy. Without one, failures either block the pipeline (unacceptable availability) or silently discard data (unacceptable for auditability).

Should schema validation happen at the producer or the consumer?

Both, ideally. Producer-side validation (enforcing a schema before writing to Kafka) prevents bad data from entering the pipeline at all. Consumer-side validation catches schema drift between the time data was written and when it is consumed. A schema registry provides a shared contract that both sides can validate against, with configurable compatibility modes (backward, forward, full) to control how schema evolution is permitted.

What metrics should I monitor to detect data quality issues in a streaming pipeline?

The key metrics are: consumer lag (unexpected growth signals processing problems), null rate per field (sudden increases signal upstream schema changes or data production bugs), record count rate (drops or spikes indicate source issues), schema violation rate, DLQ ingestion rate, and end-to-end latency. Set alerts on rate-of-change, not just absolute values - a null rate of 5% might be acceptable, but a jump from 1% to 5% in one hour is a signal worth investigating.

<--- Back to all resources

Engineering

February 25, 2026

12 min read

Data Quality in Streaming Pipelines: A Practical Framework

A practical framework for maintaining data quality in real-time streaming pipelines. Covers validation, schema enforcement, anomaly detection, dead letter queues, and monitoring.

TL;DR: • Data quality in streaming pipelines spans six dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. • Validation must happen continuously and in real time - batch checks catch problems too late to prevent downstream impact. • Schema enforcement at the broker layer (via a schema registry) is the first and most scalable line of defense. • Dead letter queues are essential infrastructure: every pipeline needs a safe place to route bad records without blocking the main flow.

Bad data in a streaming pipeline does not sit still waiting to be corrected. It propagates downstream in real time, contaminating dashboards, triggering false alerts, training ML models on incorrect examples, and - in the worst cases - causing automated systems to take wrong actions.

The discipline of data quality in streaming pipelines requires a different mindset than batch data quality. The emphasis shifts from retrospective detection to proactive prevention, continuous monitoring, and graceful degradation when prevention fails.

This guide provides a practical framework: the six quality dimensions that matter for streams, the strategies for enforcing each one, and the operational infrastructure that holds it all together.

The Six Dimensions of Streaming Data Quality

1. Accuracy

A record is accurate if its values correctly represent the real-world state they are supposed to capture. Accuracy violations include unit mismatches (a price in cents recorded as dollars), sign errors (a withdrawal recorded as a deposit), and stale reference data (a product category that was correct last month but has since changed).

Accuracy is the hardest dimension to validate automatically because it requires knowing what “correct” looks like. Strategies include:

Cross-system reconciliation: Compare aggregates between the source system and the stream destination. If the sum of orders in PostgreSQL and the sum of order events in your data warehouse diverge, investigate.
Business rule validation: Encode known constraints - prices must be positive, quantities must be integers, timestamps cannot be in the future - and flag violations.
Anomaly detection: Flag records whose values fall outside historical statistical ranges.

2. Completeness

Completeness measures whether required fields are present. A record missing its user_id cannot be attributed. A transaction without a timestamp cannot be placed in time. An event without an amount is useless for revenue reporting.

Completeness checks are straightforward to implement and should be automated:

-- Flink SQL: route incomplete records to DLQ
INSERT INTO events_dlq
SELECT *, 'INCOMPLETE: missing required fields' AS failure_reason
FROM raw_events
WHERE user_id IS NULL
   OR event_type IS NULL
   OR event_time IS NULL;

-- Route complete records to main flow
INSERT INTO validated_events
SELECT *
FROM raw_events
WHERE user_id IS NOT NULL
  AND event_type IS NOT NULL
  AND event_time IS NOT NULL;

Track null rates per field as a continuous metric. A field that is 0.01% null in steady state is alarming when it jumps to 15% - that is a signal of an upstream producer change, not a normal data pattern.

3. Consistency

Consistency means that data values agree across systems and over time. If the same user has account_tier = 'enterprise' in the CRM but account_tier = 'starter' in the event stream, there is a consistency problem. If a CDC stream shows an UPDATE to a record that was never INSERTed, the stream is internally inconsistent.

Consistency issues often arise at integration boundaries - where two systems maintain their own copies of the same information and can drift out of sync.

Referential integrity checks in Flink SQL:

-- Flag events that reference users not present in the user dimension
SELECT
  e.*,
  'CONSISTENCY: unknown user_id' AS failure_reason
FROM events e
LEFT JOIN user_dimension u ON e.user_id = u.user_id
WHERE u.user_id IS NULL;

Note that in a streaming context, this check can produce false positives: if an event arrives slightly before the corresponding user record, it will appear to reference an unknown user. Use a brief delay or a lookup cache with a short TTL to reduce false positives without masking real consistency problems.

4. Timeliness

A record is timely if it arrives within an acceptable window of when the underlying event occurred. In batch processing, timeliness is rarely a first-class concern - data arrives daily and everyone accepts the T+1 lag. In streaming, timeliness is often the core value proposition.

Timeliness has two components:

End-to-end latency: How long does it take from event occurrence to the record being available in the destination? Monitor this continuously. A sudden jump in latency often precedes a full pipeline stall.

Out-of-order arrival: Events do not always arrive in the order they occurred. A mobile app might buffer events during offline periods and flush them when connectivity is restored, producing a burst of late events. Your pipeline needs a defined policy for late events:

Accept with watermark lag: Flink’s watermark mechanism allows late events up to a configured delay to be included in their correct window.
Route to correction queue: Events beyond the watermark can be routed to a late-events topic, which a separate job reconciles periodically.
Discard with logging: In cases where late events truly cannot be used, discard them but log the count and the lag distribution for investigation.

Measuring event-time lag in Flink SQL:

SELECT
  TUMBLE_START(proc_time, INTERVAL '1' MINUTE) AS window_start,
  AVG(TIMESTAMPDIFF(SECOND, event_time, CAST(proc_time AS TIMESTAMP(3)))) AS avg_lag_seconds,
  MAX(TIMESTAMPDIFF(SECOND, event_time, CAST(proc_time AS TIMESTAMP(3)))) AS max_lag_seconds,
  COUNT(*) AS record_count
FROM raw_events
GROUP BY TUMBLE(proc_time, INTERVAL '1' MINUTE);

5. Validity

Validity means that values conform to defined domain rules: correct format, acceptable range, expected enumeration. An email field containing "N/A" is not valid. A status field containing "REFUNDED_PENDING" when the only valid values are "PENDING", "COMPLETED", and "REFUNDED" is not valid.

Validity checks are highly domain-specific but follow a common structure:

-- Flink SQL validity checks
INSERT INTO validated_orders
SELECT *
FROM raw_orders
WHERE
  -- Range checks
  amount > 0 AND amount < 1000000
  -- Enum checks
  AND status IN ('PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED', 'REFUNDED')
  -- Format checks
  AND order_id RLIKE '^ORD-[0-9]{8}$'
  -- Logical consistency
  AND (completed_at IS NULL OR completed_at >= created_at);

-- Invalid records to DLQ
INSERT INTO orders_dlq
SELECT *, 'VALIDITY: failed domain rules' AS failure_reason
FROM raw_orders
WHERE NOT (
  amount > 0 AND amount < 1000000
  AND status IN ('PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED', 'REFUNDED')
  AND order_id RLIKE '^ORD-[0-9]{8}$'
  AND (completed_at IS NULL OR completed_at >= created_at)
);

6. Uniqueness

Uniqueness ensures that records which should be distinct are not duplicated. Duplicate events are common in streaming systems: at-least-once delivery guarantees mean a record can be delivered more than once. Network retries, application retries on failure, and Kafka consumer restarts all generate duplicates.

Deduplication strategies depend on the acceptable time window and the key:

-- Streaming deduplication with row_number
-- Keeps the first occurrence of each event_id seen in the last 24 hours
SELECT event_id, user_id, event_type, event_time
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY event_id
      ORDER BY event_time ASC
    ) AS rn
  FROM raw_events
)
WHERE rn = 1;

For long-time-horizon deduplication (days or weeks), a row-number approach accumulates unbounded state. In those cases, consider using an external key-value store (Redis, DynamoDB) to track seen keys, queried via a Flink UDF.

Schema Enforcement: The First Line of Defense

Schema enforcement prevents structurally malformed records from entering the pipeline at all. It is the highest-impact quality control because it operates at the point of ingestion, before any downstream system has seen the data.

Schema Registry Integration

A schema registry stores the canonical schema for each Kafka topic and enforces compatibility rules when producers attempt to register new schemas. The Confluent Schema Registry and AWS Glue Schema Registry both follow this model.

Compatibility modes:

Mode	What It Allows	What It Prevents
BACKWARD	New schema can read data written with old schema	Removing required fields, changing field types
FORWARD	Old schema can read data written with new schema	Adding required fields
FULL	Both backward and forward	Most schema changes
NONE	Any change	Nothing - dangerous for production

For most production use cases, BACKWARD compatibility is the right choice: new consumers can read old data, enabling rolling upgrades without coordination.

Registering a schema via the Confluent Schema Registry API:

curl -X POST http://schema-registry:8081/subjects/orders-value/versions \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d '{
    "schema": "{\"type\":\"record\",\"name\":\"Order\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"amount\",\"type\":\"double\"},{\"name\":\"status\",\"type\":\"string\"}]}"
  }'

Dead Letter Queues

A dead letter queue (DLQ) is the safety valve that makes all other quality checks safe to implement aggressively. Without a DLQ, aggressive validation either:

Causes the pipeline to halt (unacceptable availability)
Silently drops bad records (unacceptable for auditability)

With a DLQ, bad records are routed to a separate topic with metadata about why they failed. The main pipeline continues uninterrupted, and the DLQ contents can be investigated, corrected, and replayed.

DLQ record structure:

{
  "original_topic": "orders",
  "original_partition": 3,
  "original_offset": 18472,
  "failure_timestamp": "2026-02-25T14:32:11Z",
  "failure_reason": "VALIDITY: amount=-50.00 fails range check",
  "failure_stage": "validation",
  "original_payload": { ... }
}

Always include enough context in the DLQ record to:

Identify the original record’s position in the source topic
Understand why it failed
Replay it to the original pipeline after correction

Anomaly Detection on Streams

Rule-based validity checks catch known bad patterns. Anomaly detection catches unknown bad patterns - deviations from the statistical normal that indicate something unexpected has happened.

Statistical Process Control

Track a rolling mean and standard deviation for key metrics, and alert when values exceed N standard deviations from the mean:

-- Detect anomalous transaction amounts using rolling statistics
SELECT
  window_start,
  window_end,
  avg_amount,
  stddev_amount,
  CASE
    WHEN ABS(current_avg - avg_amount) > 3 * stddev_amount THEN 'ANOMALY'
    ELSE 'NORMAL'
  END AS status
FROM (
  SELECT
    TUMBLE_START(event_time, INTERVAL '5' MINUTE)  AS window_start,
    TUMBLE_END(event_time, INTERVAL '5' MINUTE)    AS window_end,
    AVG(amount)                                    AS current_avg
  FROM transactions
  GROUP BY TUMBLE(event_time, INTERVAL '5' MINUTE)
) current_window
CROSS JOIN (
  SELECT AVG(amount) AS avg_amount, STDDEV(amount) AS stddev_amount
  FROM transactions
  WHERE event_time > NOW() - INTERVAL '7' DAY
) historical_stats;

Volume Anomalies

A sudden drop in event volume is often more alarming than a sudden spike. A drop can indicate a producer failure, a network partition, or a deployment that broke event emission.

Monitor events-per-minute continuously and alert when volume drops below a threshold for more than N consecutive minutes. The threshold should be relative to the historical baseline for that time of day (accounting for daily and weekly seasonality).

Monitoring and Alerting

Data quality monitoring for streaming pipelines requires a different set of metrics than infrastructure monitoring.

Key Quality Metrics

Metric	What It Reveals	Alert Condition
Null rate per field	Schema drift, producer bugs	>2x baseline in 15 min
DLQ ingestion rate	Validation failure rate	Any non-zero rate above baseline
Schema violation count	Breaking schema changes	Any non-zero in production
Consumer lag	Pipeline throughput issues	Sustained growth >10 min
Event-time lag	Late event rate	p99 lag > SLA threshold
Record count rate	Source availability	Drop >20% below baseline
Deduplication rate	Upstream retry storms	Spike above baseline

Embedding Quality Metadata

A useful pattern is to attach quality metadata to records as they pass through validation, rather than only routing failures to a DLQ:

INSERT INTO enriched_events
SELECT
  *,
  CASE
    WHEN user_id IS NULL                    THEN 'WARN:null_user_id'
    WHEN amount < 0                         THEN 'WARN:negative_amount'
    WHEN event_time < NOW() - INTERVAL '1' HOUR THEN 'WARN:late_event'
    ELSE 'OK'
  END AS quality_flag
FROM raw_events;

Downstream consumers can then filter on quality_flag = 'OK' for high-confidence analysis, or include WARN records with appropriate caveats for exploratory work.

Contrasting Batch and Streaming Data Quality

Aspect	Batch	Streaming
When checks run	After load	Continuously, in real time
Impact of bad data	Corrupt a batch that can be rerun	Propagate to live consumers immediately
Remediation	Delete bad rows, rerun transformation	DLQ + replay, or compensating events
Schema evolution	Coordinated, infrequent	Continuous, must be backward-compatible
Latency tolerance	Hours (T+1 batch)	Seconds to minutes
Anomaly detection	Threshold on daily totals	Rolling window on recent data

The core shift is that in streaming, you cannot rely on “fix it before anyone sees it.” Data has already been consumed by the time a problem is detected. This makes prevention - schema enforcement, producer-side validation, DLQ routing - more valuable than detection.

A Production Quality Checklist

Before a streaming pipeline goes live, verify:

Data quality in streaming pipelines is not a feature you add at the end - it is a property you design in from the beginning. Tools like Streamkap expose pipeline-level quality controls (field validation, null handling, DLQ routing) as first-class configuration, making it practical to enforce quality standards without building custom validation infrastructure.

For guidance on how transformations fit into the quality story, see Stream Data Transformation: Patterns for Shaping Data in Real Time.

Platform

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company