<--- Back to all resources

Engineering

February 25, 2026

12 min read

Data Quality in Streaming Pipelines: A Practical Framework

A practical framework for maintaining data quality in real-time streaming pipelines. Covers validation, schema enforcement, anomaly detection, dead letter queues, and monitoring.

TL;DR: • Data quality in streaming pipelines spans six dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. • Validation must happen continuously and in real time - batch checks catch problems too late to prevent downstream impact. • Schema enforcement at the broker layer (via a schema registry) is the first and most scalable line of defense. • Dead letter queues are essential infrastructure: every pipeline needs a safe place to route bad records without blocking the main flow.

Bad data in a streaming pipeline does not sit still waiting to be corrected. It propagates downstream in real time, contaminating dashboards, triggering false alerts, training ML models on incorrect examples, and - in the worst cases - causing automated systems to take wrong actions.

The discipline of data quality in streaming pipelines requires a different mindset than batch data quality. The emphasis shifts from retrospective detection to proactive prevention, continuous monitoring, and graceful degradation when prevention fails.

This guide provides a practical framework: the six quality dimensions that matter for streams, the strategies for enforcing each one, and the operational infrastructure that holds it all together.

The Six Dimensions of Streaming Data Quality

1. Accuracy

A record is accurate if its values correctly represent the real-world state they are supposed to capture. Accuracy violations include unit mismatches (a price in cents recorded as dollars), sign errors (a withdrawal recorded as a deposit), and stale reference data (a product category that was correct last month but has since changed).

Accuracy is the hardest dimension to validate automatically because it requires knowing what “correct” looks like. Strategies include:

  • Cross-system reconciliation: Compare aggregates between the source system and the stream destination. If the sum of orders in PostgreSQL and the sum of order events in your data warehouse diverge, investigate.
  • Business rule validation: Encode known constraints - prices must be positive, quantities must be integers, timestamps cannot be in the future - and flag violations.
  • Anomaly detection: Flag records whose values fall outside historical statistical ranges.

2. Completeness

Completeness measures whether required fields are present. A record missing its user_id cannot be attributed. A transaction without a timestamp cannot be placed in time. An event without an amount is useless for revenue reporting.

Completeness checks are straightforward to implement and should be automated:

-- Flink SQL: route incomplete records to DLQ
INSERT INTO events_dlq
SELECT *, 'INCOMPLETE: missing required fields' AS failure_reason
FROM raw_events
WHERE user_id IS NULL
   OR event_type IS NULL
   OR event_time IS NULL;

-- Route complete records to main flow
INSERT INTO validated_events
SELECT *
FROM raw_events
WHERE user_id IS NOT NULL
  AND event_type IS NOT NULL
  AND event_time IS NOT NULL;

Track null rates per field as a continuous metric. A field that is 0.01% null in steady state is alarming when it jumps to 15% - that is a signal of an upstream producer change, not a normal data pattern.

3. Consistency

Consistency means that data values agree across systems and over time. If the same user has account_tier = 'enterprise' in the CRM but account_tier = 'starter' in the event stream, there is a consistency problem. If a CDC stream shows an UPDATE to a record that was never INSERTed, the stream is internally inconsistent.

Consistency issues often arise at integration boundaries - where two systems maintain their own copies of the same information and can drift out of sync.

Referential integrity checks in Flink SQL:

-- Flag events that reference users not present in the user dimension
SELECT
  e.*,
  'CONSISTENCY: unknown user_id' AS failure_reason
FROM events e
LEFT JOIN user_dimension u ON e.user_id = u.user_id
WHERE u.user_id IS NULL;

Note that in a streaming context, this check can produce false positives: if an event arrives slightly before the corresponding user record, it will appear to reference an unknown user. Use a brief delay or a lookup cache with a short TTL to reduce false positives without masking real consistency problems.

4. Timeliness

A record is timely if it arrives within an acceptable window of when the underlying event occurred. In batch processing, timeliness is rarely a first-class concern - data arrives daily and everyone accepts the T+1 lag. In streaming, timeliness is often the core value proposition.

Timeliness has two components:

End-to-end latency: How long does it take from event occurrence to the record being available in the destination? Monitor this continuously. A sudden jump in latency often precedes a full pipeline stall.

Out-of-order arrival: Events do not always arrive in the order they occurred. A mobile app might buffer events during offline periods and flush them when connectivity is restored, producing a burst of late events. Your pipeline needs a defined policy for late events:

  • Accept with watermark lag: Flink’s watermark mechanism allows late events up to a configured delay to be included in their correct window.
  • Route to correction queue: Events beyond the watermark can be routed to a late-events topic, which a separate job reconciles periodically.
  • Discard with logging: In cases where late events truly cannot be used, discard them but log the count and the lag distribution for investigation.

Measuring event-time lag in Flink SQL:

SELECT
  TUMBLE_START(proc_time, INTERVAL '1' MINUTE) AS window_start,
  AVG(TIMESTAMPDIFF(SECOND, event_time, CAST(proc_time AS TIMESTAMP(3)))) AS avg_lag_seconds,
  MAX(TIMESTAMPDIFF(SECOND, event_time, CAST(proc_time AS TIMESTAMP(3)))) AS max_lag_seconds,
  COUNT(*) AS record_count
FROM raw_events
GROUP BY TUMBLE(proc_time, INTERVAL '1' MINUTE);

5. Validity

Validity means that values conform to defined domain rules: correct format, acceptable range, expected enumeration. An email field containing "N/A" is not valid. A status field containing "REFUNDED_PENDING" when the only valid values are "PENDING", "COMPLETED", and "REFUNDED" is not valid.

Validity checks are highly domain-specific but follow a common structure:

-- Flink SQL validity checks
INSERT INTO validated_orders
SELECT *
FROM raw_orders
WHERE
  -- Range checks
  amount > 0 AND amount < 1000000
  -- Enum checks
  AND status IN ('PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED', 'REFUNDED')
  -- Format checks
  AND order_id RLIKE '^ORD-[0-9]{8}$'
  -- Logical consistency
  AND (completed_at IS NULL OR completed_at >= created_at);

-- Invalid records to DLQ
INSERT INTO orders_dlq
SELECT *, 'VALIDITY: failed domain rules' AS failure_reason
FROM raw_orders
WHERE NOT (
  amount > 0 AND amount < 1000000
  AND status IN ('PENDING', 'PROCESSING', 'COMPLETED', 'CANCELLED', 'REFUNDED')
  AND order_id RLIKE '^ORD-[0-9]{8}$'
  AND (completed_at IS NULL OR completed_at >= created_at)
);

6. Uniqueness

Uniqueness ensures that records which should be distinct are not duplicated. Duplicate events are common in streaming systems: at-least-once delivery guarantees mean a record can be delivered more than once. Network retries, application retries on failure, and Kafka consumer restarts all generate duplicates.

Deduplication strategies depend on the acceptable time window and the key:

-- Streaming deduplication with row_number
-- Keeps the first occurrence of each event_id seen in the last 24 hours
SELECT event_id, user_id, event_type, event_time
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY event_id
      ORDER BY event_time ASC
    ) AS rn
  FROM raw_events
)
WHERE rn = 1;

For long-time-horizon deduplication (days or weeks), a row-number approach accumulates unbounded state. In those cases, consider using an external key-value store (Redis, DynamoDB) to track seen keys, queried via a Flink UDF.

Schema Enforcement: The First Line of Defense

Schema enforcement prevents structurally malformed records from entering the pipeline at all. It is the highest-impact quality control because it operates at the point of ingestion, before any downstream system has seen the data.

Schema Registry Integration

A schema registry stores the canonical schema for each Kafka topic and enforces compatibility rules when producers attempt to register new schemas. The Confluent Schema Registry and AWS Glue Schema Registry both follow this model.

Compatibility modes:

ModeWhat It AllowsWhat It Prevents
BACKWARDNew schema can read data written with old schemaRemoving required fields, changing field types
FORWARDOld schema can read data written with new schemaAdding required fields
FULLBoth backward and forwardMost schema changes
NONEAny changeNothing - dangerous for production

For most production use cases, BACKWARD compatibility is the right choice: new consumers can read old data, enabling rolling upgrades without coordination.

Registering a schema via the Confluent Schema Registry API:

curl -X POST http://schema-registry:8081/subjects/orders-value/versions \
  -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  -d '{
    "schema": "{\"type\":\"record\",\"name\":\"Order\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"amount\",\"type\":\"double\"},{\"name\":\"status\",\"type\":\"string\"}]}"
  }'

Dead Letter Queues

A dead letter queue (DLQ) is the safety valve that makes all other quality checks safe to implement aggressively. Without a DLQ, aggressive validation either:

  • Causes the pipeline to halt (unacceptable availability)
  • Silently drops bad records (unacceptable for auditability)

With a DLQ, bad records are routed to a separate topic with metadata about why they failed. The main pipeline continues uninterrupted, and the DLQ contents can be investigated, corrected, and replayed.

DLQ record structure:

{
  "original_topic": "orders",
  "original_partition": 3,
  "original_offset": 18472,
  "failure_timestamp": "2026-02-25T14:32:11Z",
  "failure_reason": "VALIDITY: amount=-50.00 fails range check",
  "failure_stage": "validation",
  "original_payload": { ... }
}

Always include enough context in the DLQ record to:

  1. Identify the original record’s position in the source topic
  2. Understand why it failed
  3. Replay it to the original pipeline after correction

Anomaly Detection on Streams

Rule-based validity checks catch known bad patterns. Anomaly detection catches unknown bad patterns - deviations from the statistical normal that indicate something unexpected has happened.

Statistical Process Control

Track a rolling mean and standard deviation for key metrics, and alert when values exceed N standard deviations from the mean:

-- Detect anomalous transaction amounts using rolling statistics
SELECT
  window_start,
  window_end,
  avg_amount,
  stddev_amount,
  CASE
    WHEN ABS(current_avg - avg_amount) > 3 * stddev_amount THEN 'ANOMALY'
    ELSE 'NORMAL'
  END AS status
FROM (
  SELECT
    TUMBLE_START(event_time, INTERVAL '5' MINUTE)  AS window_start,
    TUMBLE_END(event_time, INTERVAL '5' MINUTE)    AS window_end,
    AVG(amount)                                    AS current_avg
  FROM transactions
  GROUP BY TUMBLE(event_time, INTERVAL '5' MINUTE)
) current_window
CROSS JOIN (
  SELECT AVG(amount) AS avg_amount, STDDEV(amount) AS stddev_amount
  FROM transactions
  WHERE event_time > NOW() - INTERVAL '7' DAY
) historical_stats;

Volume Anomalies

A sudden drop in event volume is often more alarming than a sudden spike. A drop can indicate a producer failure, a network partition, or a deployment that broke event emission.

Monitor events-per-minute continuously and alert when volume drops below a threshold for more than N consecutive minutes. The threshold should be relative to the historical baseline for that time of day (accounting for daily and weekly seasonality).

Monitoring and Alerting

Data quality monitoring for streaming pipelines requires a different set of metrics than infrastructure monitoring.

Key Quality Metrics

MetricWhat It RevealsAlert Condition
Null rate per fieldSchema drift, producer bugs>2x baseline in 15 min
DLQ ingestion rateValidation failure rateAny non-zero rate above baseline
Schema violation countBreaking schema changesAny non-zero in production
Consumer lagPipeline throughput issuesSustained growth >10 min
Event-time lagLate event ratep99 lag > SLA threshold
Record count rateSource availabilityDrop >20% below baseline
Deduplication rateUpstream retry stormsSpike above baseline

Embedding Quality Metadata

A useful pattern is to attach quality metadata to records as they pass through validation, rather than only routing failures to a DLQ:

INSERT INTO enriched_events
SELECT
  *,
  CASE
    WHEN user_id IS NULL                    THEN 'WARN:null_user_id'
    WHEN amount < 0                         THEN 'WARN:negative_amount'
    WHEN event_time < NOW() - INTERVAL '1' HOUR THEN 'WARN:late_event'
    ELSE 'OK'
  END AS quality_flag
FROM raw_events;

Downstream consumers can then filter on quality_flag = 'OK' for high-confidence analysis, or include WARN records with appropriate caveats for exploratory work.

Contrasting Batch and Streaming Data Quality

AspectBatchStreaming
When checks runAfter loadContinuously, in real time
Impact of bad dataCorrupt a batch that can be rerunPropagate to live consumers immediately
RemediationDelete bad rows, rerun transformationDLQ + replay, or compensating events
Schema evolutionCoordinated, infrequentContinuous, must be backward-compatible
Latency toleranceHours (T+1 batch)Seconds to minutes
Anomaly detectionThreshold on daily totalsRolling window on recent data

The core shift is that in streaming, you cannot rely on “fix it before anyone sees it.” Data has already been consumed by the time a problem is detected. This makes prevention - schema enforcement, producer-side validation, DLQ routing - more valuable than detection.

A Production Quality Checklist

Before a streaming pipeline goes live, verify:

  • Schema registered in schema registry with appropriate compatibility mode
  • Producer validates schema before writing
  • Null rate checks configured for all required fields
  • Business rule validations encoded in Flink SQL or transform layer
  • DLQ topic created, accessible, and monitored
  • DLQ records include enough metadata for replay
  • Volume alerting configured (drop and spike)
  • Consumer lag alerting configured
  • End-to-end latency SLA defined and monitored
  • Late event policy documented and implemented
  • Deduplication strategy in place for at-least-once sources
  • Runbook exists for each alert condition

Data quality in streaming pipelines is not a feature you add at the end - it is a property you design in from the beginning. Tools like Streamkap expose pipeline-level quality controls (field validation, null handling, DLQ routing) as first-class configuration, making it practical to enforce quality standards without building custom validation infrastructure.

For guidance on how transformations fit into the quality story, see Stream Data Transformation: Patterns for Shaping Data in Real Time.