<--- Back to all resources
Dead Letter Queues in Stream Processing: Handling Bad Data Gracefully
Learn how to use dead letter queues (DLQs) in streaming pipelines to handle malformed, invalid, or unprocessable records without stopping the entire pipeline.
Every data engineer has lived through the same nightmare: a single malformed record enters a streaming pipeline, processing stalls, data stops flowing to downstream consumers, and the on-call engineer gets paged at 2 AM. The root cause is often something trivial - an unexpected null field, a JSON document with a trailing comma, a timestamp in the wrong format. But the impact is disproportionate because the pipeline treated every error as fatal.
Dead letter queues solve this problem. Instead of halting the entire pipeline when a record cannot be processed, a DLQ captures the problematic record in a separate destination, allows the pipeline to continue processing the remaining healthy records, and preserves enough context for an engineer to diagnose and fix the issue later.
This guide covers the design, implementation, and operational practices for dead letter queues in stream processing systems.
What Goes to a Dead Letter Queue
Not every failure belongs in a DLQ. The records that end up there share one characteristic: they represent a problem with the data itself, not with the infrastructure. Here are the main categories.
Deserialization Errors
The most common DLQ candidate. A producer writes a record that the consumer cannot parse - malformed JSON, corrupted Avro, or a payload that does not match the expected wire format. The consumer has no way to interpret the bytes, so it cannot process them regardless of how many times it retries.
Schema Validation Failures
The record deserializes successfully, but its structure violates the expected schema. A required field is missing, a numeric field contains a string, or an enum value is not in the allowed set. These are particularly common in CDC pipelines where upstream schema changes propagate before the downstream consumer is updated.
Transformation Errors
The record is structurally valid but causes a runtime error during processing - division by zero in a calculated field, a null pointer when accessing a nested object, or a regex that fails on unexpected input. User-defined functions (UDFs) are a frequent source of these errors.
Destination Write Failures
The record transforms successfully but the destination rejects it. A value exceeds a column length constraint, a foreign key does not exist, or a unique constraint is violated. These are record-specific failures that retrying will not fix.
DLQ vs Retry: Knowing the Difference
The decision of whether to retry a failed record or route it to the DLQ depends on one question: is the error transient or persistent?
Transient errors are infrastructure problems that resolve themselves - network timeouts, temporary unavailability of a downstream service, rate limiting, or brief disk pressure on a broker. These should be retried with exponential backoff. Sending them to a DLQ would discard records that would have succeeded on the next attempt.
Persistent errors are data problems that no amount of retrying will fix. A malformed JSON document will not become valid JSON on the second attempt. A schema violation will not resolve itself. These belong in the DLQ.
A common pattern is to combine both: retry a configurable number of times (typically 3-5 for transient errors), and if the record still fails, route it to the DLQ. This catches the edge case where what appears to be a transient error is actually persistent.
Record fails → Retry (up to N times with backoff)
→ Success? Continue processing
→ Still failing? Route to DLQ
DLQ Record Structure
A DLQ record should contain everything an engineer needs to diagnose the problem and replay the record after fixing it. Storing only the error message without the original data is useless. Storing only the original data without the error context forces the engineer to reproduce the failure locally.
Here is a well-structured DLQ record:
{
"original_record": {
"key": "user-12345",
"value": "{\"name\": \"Alice\", \"age\": \"not_a_number\"}",
"headers": {
"content-type": "application/json"
}
},
"error": {
"message": "Schema validation failed: field 'age' expected integer, got string",
"exception_class": "org.apache.kafka.connect.errors.SchemaValidationException",
"stack_trace": "..."
},
"source": {
"topic": "users.cdc.events",
"partition": 3,
"offset": 1847293,
"timestamp": "2026-02-25T14:32:01.445Z"
},
"pipeline": {
"connector_name": "users-to-snowflake",
"task_id": 2,
"retry_count": 3
},
"dlq_timestamp": "2026-02-25T14:32:05.112Z"
}
Key fields to always include:
- original_record: The raw record exactly as it was received, including key, value, and headers
- error message and class: What went wrong and what type of exception was thrown
- source metadata: Topic, partition, and offset so you can locate the record in the source system
- pipeline context: Which connector or processor failed, which task, and how many retries were attempted
- DLQ timestamp: When the record was routed to the DLQ (distinct from the original record timestamp)
DLQ Destinations: Where to Send Failed Records
The choice of DLQ destination depends on your operational workflow and the volume of failed records.
Kafka Topic
The most natural choice when your pipeline already runs on Kafka. Failed records go to a dedicated topic (e.g., dlq.users-pipeline) and retain Kafka’s ordering, partitioning, and retention semantics. Pros: easy to consume and replay using standard Kafka tooling, supports high throughput. Cons: subject to Kafka retention limits, requires Kafka access for inspection.
Object Storage (S3/GCS)
Good for long-term retention and environments where DLQ records need to be queried ad hoc. Each failed record or batch is written as a JSON file with a structured path like s3://dlq/users-pipeline/2026/02/25/14/record-001.json. Pros: cheap storage, easy to query with Athena or BigQuery, no retention pressure. Cons: higher write latency, harder to replay programmatically.
Database Table
Useful when your team prefers SQL-based investigation. A DLQ table with columns for the original record, error details, source metadata, and status (new, investigating, replayed, discarded) gives you a built-in workflow. Pros: familiar query interface, easy to build dashboards. Cons: can become a bottleneck at high error volumes, requires schema management.
In practice, many teams combine approaches - writing to a Kafka topic for immediate replay capability and syncing to S3 or a database for long-term analysis.
Implementation Patterns
Try-Catch in Stream Processors
The simplest pattern. Wrap your processing logic in a try-catch block and route caught exceptions to the DLQ.
def process_record(record):
try:
validated = validate_schema(record)
transformed = apply_transformations(validated)
write_to_destination(transformed)
except SchemaValidationError as e:
write_to_dlq(record, error=e, error_type="schema_validation")
except TransformationError as e:
write_to_dlq(record, error=e, error_type="transformation")
except DestinationWriteError as e:
if e.is_transient:
raise # Let the framework retry
write_to_dlq(record, error=e, error_type="write_failure")
Kafka Connect Error Handling
Kafka Connect has built-in DLQ support via the errors.tolerance and errors.deadletterqueue.topic.name configuration properties.
# Enable error tolerance - skip bad records instead of failing the task
errors.tolerance=all
# Route failed records to a DLQ topic
errors.deadletterqueue.topic.name=dlq.my-connector
errors.deadletterqueue.topic.replication.factor=3
# Include error context in DLQ record headers
errors.deadletterqueue.context.headers.enable=true
# Log errors for monitoring
errors.log.enable=true
errors.log.include.messages=true
This configuration tells Kafka Connect to tolerate all errors, route failed records to the specified DLQ topic, and attach error context as record headers. Streamkap uses this pattern under the hood, providing built-in DLQ handling with full error context and dashboard visibility so you do not need to configure these properties manually.
Flink Side Outputs
Apache Flink provides side outputs as a first-class mechanism for routing records to alternative destinations. This is the idiomatic way to implement DLQs in Flink.
OutputTag<String> dlqTag = new OutputTag<String>("dlq") {};
SingleOutputStreamOperator<ProcessedRecord> mainStream = input
.process(new ProcessFunction<RawRecord, ProcessedRecord>() {
@Override
public void processElement(RawRecord record, Context ctx,
Collector<ProcessedRecord> out) {
try {
out.collect(transform(record));
} catch (Exception e) {
ctx.output(dlqTag, buildDlqRecord(record, e));
}
}
});
DataStream<String> dlqStream = mainStream.getSideOutput(dlqTag);
dlqStream.sinkTo(dlqSink);
Monitoring and Alerting
A DLQ that nobody monitors is just a graveyard for lost data. Effective DLQ operations require three layers of observability.
Volume metrics: Track the number of records entering the DLQ per minute and per hour. A sudden spike indicates a systemic issue - a schema change, a bad deployment, or a producer bug. A steady trickle is normal for most pipelines.
Error categorization: Group DLQ records by error type. If 95% of your DLQ records are schema validation failures on a single field, that tells you exactly where to look. Without categorization, you are sifting through noise.
Alert thresholds: Set alerts at two levels. A warning when the DLQ rate exceeds a baseline (e.g., more than 100 records per hour). A critical alert when the DLQ rate suggests the pipeline is functionally broken (e.g., more than 10% of records are failing). Adjust thresholds based on your pipeline’s normal error rate.
Replay Strategies
The whole point of a DLQ is to recover the data after fixing the root cause. Without a replay strategy, a DLQ is just a more polite way of dropping records.
Manual Replay
Read records from the DLQ, inspect them, apply corrections if needed, and re-inject them into the main pipeline input topic. This works for low-volume DLQs and one-off issues. A simple script that reads from the DLQ topic and produces to the main topic is often sufficient.
Automated Replay Pipelines
For high-volume DLQs or recurring error patterns, build a secondary pipeline that periodically reads the DLQ, applies a known fix (e.g., coerce the field to the correct type, fill in a default for a missing field), and writes to the destination. This requires confidence that the fix is correct and will not introduce new errors.
Partial Replay
Sometimes only a subset of DLQ records are recoverable. Filter the DLQ by error type or time window, replay only the fixable records, and archive or discard the rest. This is common when the root cause is a producer bug that was already fixed - records produced after the fix are fine, but the corrupted records from before the fix may be unrecoverable.
Practical Example: CDC Pipeline with DLQ
Consider a CDC pipeline that captures changes from a PostgreSQL database, transforms them, and writes to a data warehouse. Here is how errors flow through the system:
- Source: Debezium captures an
INSERTon theorderstable. The record includes a JSON columnorder_metadatathat the application writes as a free-form string. - Deserialization: The consumer reads the CDC event. The
order_metadatafield contains{status: "pending"}- invalid JSON because the keys are not quoted. The deserialization step fails. - DLQ routing: The pipeline catches the
JsonParseException, wraps the original CDC event with the error context, and writes it to thedlq.orders-pipelineKafka topic. - Pipeline continues: The remaining records in the batch process successfully. Downstream dashboards continue to update.
- Investigation: An engineer queries the DLQ, sees a cluster of
JsonParseExceptionerrors on theorder_metadatafield, traces it to a recent application deployment that introduced a logging library writing non-standard JSON. - Fix and replay: The application bug is fixed. The engineer writes a small script that reads the DLQ records, parses the non-standard JSON with a lenient parser, and re-injects the corrected records into the pipeline.
Without the DLQ, this scenario would have halted the entire orders pipeline until the application team deployed a fix - potentially hours of data loss.
Best Practices
Always preserve the original record. Never transform or truncate the data before writing it to the DLQ. You need the exact bytes that caused the failure to reproduce and diagnose the issue.
Include enough context to replay without guessing. Source topic, partition, offset, timestamp, connector name, task ID, and retry count. An engineer looking at a DLQ record six hours after the fact should not need to cross-reference three different systems to understand what happened.
Set retention policies. DLQ data should not live forever. Set a retention period (30-90 days is common) and archive to cold storage if you need longer retention for compliance.
Review DLQ records regularly. Schedule a weekly review of DLQ contents. Patterns in DLQ records reveal upstream data quality issues that should be fixed at the source, not worked around in the pipeline.
Separate DLQs per pipeline. A single shared DLQ for all pipelines makes triage difficult. Each pipeline should have its own DLQ destination with clear naming conventions.
Test your DLQ path. Intentionally send a malformed record through your pipeline in a staging environment and verify that it lands in the DLQ with complete metadata. Platforms like Streamkap make this easier by providing built-in DLQ visibility, but regardless of your tooling, the DLQ path should be tested as rigorously as the happy path.
The question is never whether bad data will arrive in your pipeline. It will. The question is whether your pipeline handles it gracefully or falls over. A well-designed dead letter queue is the difference between a minor operational inconvenience and a full-blown data outage.