What is a dead letter queue in streaming?

A dead letter queue is a separate destination (typically a Kafka topic, S3 bucket, or database table) where records that fail processing are sent instead of being dropped or blocking the pipeline. The DLQ preserves the original record alongside error metadata, allowing engineers to diagnose the issue, fix the root cause, and replay the records.

What types of errors should go to a dead letter queue?

Common DLQ-worthy errors include: malformed JSON or Avro that cannot be deserialized, schema validation failures (missing required fields, wrong types), transformation errors (division by zero, null pointer in UDF), and destination write failures for specific records (constraint violations, value too large). Transient errors like network timeouts should be retried, not sent to DLQ.

How do I replay records from a dead letter queue?

After fixing the root cause (schema update, code fix, data correction), read records from the DLQ, apply any necessary corrections, and re-inject them into the main pipeline or write directly to the destination. Tooling varies - some teams use scripts, others build automated replay pipelines. Include enough metadata in DLQ records (original topic, partition, offset) to support replay.

Does Streamkap support dead letter queues?

Yes. Streamkap provides built-in dead letter queue handling for streaming pipelines. Records that fail transformation or delivery are automatically routed to a DLQ with full error context. You can monitor DLQ volume in the dashboard and set up alerts for error spikes.

<--- Back to all resources

Engineering

February 25, 2026

9 min read

Dead Letter Queues in Stream Processing: Handling Bad Data Gracefully

Learn how to use dead letter queues (DLQs) in streaming pipelines to handle malformed, invalid, or unprocessable records without stopping the entire pipeline.

TL;DR: • Dead letter queues (DLQs) capture records that fail processing - malformed JSON, schema mismatches, transformation errors - instead of stopping the entire pipeline. • A well-designed DLQ preserves the original record, the error details, and enough context to diagnose and replay the failed record after fixing the issue. • Every production streaming pipeline should have a DLQ strategy - the question is not whether bad data will arrive, but when.

Every data engineer has lived through the same nightmare: a single malformed record enters a streaming pipeline, processing stalls, data stops flowing to downstream consumers, and the on-call engineer gets paged at 2 AM. The root cause is often something trivial - an unexpected null field, a JSON document with a trailing comma, a timestamp in the wrong format. But the impact is disproportionate because the pipeline treated every error as fatal.

Dead letter queues solve this problem. Instead of halting the entire pipeline when a record cannot be processed, a DLQ captures the problematic record in a separate destination, allows the pipeline to continue processing the remaining healthy records, and preserves enough context for an engineer to diagnose and fix the issue later.

This guide covers the design, implementation, and operational practices for dead letter queues in stream processing systems.

What Goes to a Dead Letter Queue

Not every failure belongs in a DLQ. The records that end up there share one characteristic: they represent a problem with the data itself, not with the infrastructure. Here are the main categories.

Deserialization Errors

The most common DLQ candidate. A producer writes a record that the consumer cannot parse - malformed JSON, corrupted Avro, or a payload that does not match the expected wire format. The consumer has no way to interpret the bytes, so it cannot process them regardless of how many times it retries.

Schema Validation Failures

The record deserializes successfully, but its structure violates the expected schema. A required field is missing, a numeric field contains a string, or an enum value is not in the allowed set. These are particularly common in CDC pipelines where upstream schema changes propagate before the downstream consumer is updated.

Transformation Errors

The record is structurally valid but causes a runtime error during processing - division by zero in a calculated field, a null pointer when accessing a nested object, or a regex that fails on unexpected input. User-defined functions (UDFs) are a frequent source of these errors.

Destination Write Failures

The record transforms successfully but the destination rejects it. A value exceeds a column length constraint, a foreign key does not exist, or a unique constraint is violated. These are record-specific failures that retrying will not fix.

DLQ vs Retry: Knowing the Difference

The decision of whether to retry a failed record or route it to the DLQ depends on one question: is the error transient or persistent?

Transient errors are infrastructure problems that resolve themselves - network timeouts, temporary unavailability of a downstream service, rate limiting, or brief disk pressure on a broker. These should be retried with exponential backoff. Sending them to a DLQ would discard records that would have succeeded on the next attempt.

Persistent errors are data problems that no amount of retrying will fix. A malformed JSON document will not become valid JSON on the second attempt. A schema violation will not resolve itself. These belong in the DLQ.

A common pattern is to combine both: retry a configurable number of times (typically 3-5 for transient errors), and if the record still fails, route it to the DLQ. This catches the edge case where what appears to be a transient error is actually persistent.

Record fails → Retry (up to N times with backoff)
  → Success? Continue processing
  → Still failing? Route to DLQ

DLQ Record Structure

A DLQ record should contain everything an engineer needs to diagnose the problem and replay the record after fixing it. Storing only the error message without the original data is useless. Storing only the original data without the error context forces the engineer to reproduce the failure locally.

Here is a well-structured DLQ record:

{
  "original_record": {
    "key": "user-12345",
    "value": "{\"name\": \"Alice\", \"age\": \"not_a_number\"}",
    "headers": {
      "content-type": "application/json"
    }
  },
  "error": {
    "message": "Schema validation failed: field 'age' expected integer, got string",
    "exception_class": "org.apache.kafka.connect.errors.SchemaValidationException",
    "stack_trace": "..."
  },
  "source": {
    "topic": "users.cdc.events",
    "partition": 3,
    "offset": 1847293,
    "timestamp": "2026-02-25T14:32:01.445Z"
  },
  "pipeline": {
    "connector_name": "users-to-snowflake",
    "task_id": 2,
    "retry_count": 3
  },
  "dlq_timestamp": "2026-02-25T14:32:05.112Z"
}

Key fields to always include:

original_record: The raw record exactly as it was received, including key, value, and headers
error message and class: What went wrong and what type of exception was thrown
source metadata: Topic, partition, and offset so you can locate the record in the source system
pipeline context: Which connector or processor failed, which task, and how many retries were attempted
DLQ timestamp: When the record was routed to the DLQ (distinct from the original record timestamp)

DLQ Destinations: Where to Send Failed Records

The choice of DLQ destination depends on your operational workflow and the volume of failed records.

Kafka Topic

The most natural choice when your pipeline already runs on Kafka. Failed records go to a dedicated topic (e.g., dlq.users-pipeline) and retain Kafka’s ordering, partitioning, and retention semantics. Pros: easy to consume and replay using standard Kafka tooling, supports high throughput. Cons: subject to Kafka retention limits, requires Kafka access for inspection.

Object Storage (S3/GCS)

Good for long-term retention and environments where DLQ records need to be queried ad hoc. Each failed record or batch is written as a JSON file with a structured path like s3://dlq/users-pipeline/2026/02/25/14/record-001.json. Pros: cheap storage, easy to query with Athena or BigQuery, no retention pressure. Cons: higher write latency, harder to replay programmatically.

Database Table

Useful when your team prefers SQL-based investigation. A DLQ table with columns for the original record, error details, source metadata, and status (new, investigating, replayed, discarded) gives you a built-in workflow. Pros: familiar query interface, easy to build dashboards. Cons: can become a bottleneck at high error volumes, requires schema management.

In practice, many teams combine approaches - writing to a Kafka topic for immediate replay capability and syncing to S3 or a database for long-term analysis.

Implementation Patterns

Try-Catch in Stream Processors

The simplest pattern. Wrap your processing logic in a try-catch block and route caught exceptions to the DLQ.

def process_record(record):
    try:
        validated = validate_schema(record)
        transformed = apply_transformations(validated)
        write_to_destination(transformed)
    except SchemaValidationError as e:
        write_to_dlq(record, error=e, error_type="schema_validation")
    except TransformationError as e:
        write_to_dlq(record, error=e, error_type="transformation")
    except DestinationWriteError as e:
        if e.is_transient:
            raise  # Let the framework retry
        write_to_dlq(record, error=e, error_type="write_failure")

Kafka Connect Error Handling

Kafka Connect has built-in DLQ support via the errors.tolerance and errors.deadletterqueue.topic.name configuration properties.

# Enable error tolerance - skip bad records instead of failing the task
errors.tolerance=all

# Route failed records to a DLQ topic
errors.deadletterqueue.topic.name=dlq.my-connector
errors.deadletterqueue.topic.replication.factor=3

# Include error context in DLQ record headers
errors.deadletterqueue.context.headers.enable=true

# Log errors for monitoring
errors.log.enable=true
errors.log.include.messages=true

This configuration tells Kafka Connect to tolerate all errors, route failed records to the specified DLQ topic, and attach error context as record headers. Streamkap uses this pattern under the hood, providing built-in DLQ handling with full error context and dashboard visibility so you do not need to configure these properties manually.

Flink Side Outputs

Apache Flink provides side outputs as a first-class mechanism for routing records to alternative destinations. This is the idiomatic way to implement DLQs in Flink.

OutputTag<String> dlqTag = new OutputTag<String>("dlq") {};

SingleOutputStreamOperator<ProcessedRecord> mainStream = input
    .process(new ProcessFunction<RawRecord, ProcessedRecord>() {
        @Override
        public void processElement(RawRecord record, Context ctx,
                                   Collector<ProcessedRecord> out) {
            try {
                out.collect(transform(record));
            } catch (Exception e) {
                ctx.output(dlqTag, buildDlqRecord(record, e));
            }
        }
    });

DataStream<String> dlqStream = mainStream.getSideOutput(dlqTag);
dlqStream.sinkTo(dlqSink);

Monitoring and Alerting

A DLQ that nobody monitors is just a graveyard for lost data. Effective DLQ operations require three layers of observability.

Volume metrics: Track the number of records entering the DLQ per minute and per hour. A sudden spike indicates a systemic issue - a schema change, a bad deployment, or a producer bug. A steady trickle is normal for most pipelines.

Error categorization: Group DLQ records by error type. If 95% of your DLQ records are schema validation failures on a single field, that tells you exactly where to look. Without categorization, you are sifting through noise.

Alert thresholds: Set alerts at two levels. A warning when the DLQ rate exceeds a baseline (e.g., more than 100 records per hour). A critical alert when the DLQ rate suggests the pipeline is functionally broken (e.g., more than 10% of records are failing). Adjust thresholds based on your pipeline’s normal error rate.

Replay Strategies

The whole point of a DLQ is to recover the data after fixing the root cause. Without a replay strategy, a DLQ is just a more polite way of dropping records.

Manual Replay

Read records from the DLQ, inspect them, apply corrections if needed, and re-inject them into the main pipeline input topic. This works for low-volume DLQs and one-off issues. A simple script that reads from the DLQ topic and produces to the main topic is often sufficient.

Automated Replay Pipelines

For high-volume DLQs or recurring error patterns, build a secondary pipeline that periodically reads the DLQ, applies a known fix (e.g., coerce the field to the correct type, fill in a default for a missing field), and writes to the destination. This requires confidence that the fix is correct and will not introduce new errors.

Partial Replay

Sometimes only a subset of DLQ records are recoverable. Filter the DLQ by error type or time window, replay only the fixable records, and archive or discard the rest. This is common when the root cause is a producer bug that was already fixed - records produced after the fix are fine, but the corrupted records from before the fix may be unrecoverable.

Practical Example: CDC Pipeline with DLQ

Consider a CDC pipeline that captures changes from a PostgreSQL database, transforms them, and writes to a data warehouse. Here is how errors flow through the system:

Source: Debezium captures an INSERT on the orders table. The record includes a JSON column order_metadata that the application writes as a free-form string.
Deserialization: The consumer reads the CDC event. The order_metadata field contains {status: "pending"} - invalid JSON because the keys are not quoted. The deserialization step fails.
DLQ routing: The pipeline catches the JsonParseException, wraps the original CDC event with the error context, and writes it to the dlq.orders-pipeline Kafka topic.
Pipeline continues: The remaining records in the batch process successfully. Downstream dashboards continue to update.
Investigation: An engineer queries the DLQ, sees a cluster of JsonParseException errors on the order_metadata field, traces it to a recent application deployment that introduced a logging library writing non-standard JSON.
Fix and replay: The application bug is fixed. The engineer writes a small script that reads the DLQ records, parses the non-standard JSON with a lenient parser, and re-injects the corrected records into the pipeline.

Without the DLQ, this scenario would have halted the entire orders pipeline until the application team deployed a fix - potentially hours of data loss.

Best Practices

Always preserve the original record. Never transform or truncate the data before writing it to the DLQ. You need the exact bytes that caused the failure to reproduce and diagnose the issue.

Include enough context to replay without guessing. Source topic, partition, offset, timestamp, connector name, task ID, and retry count. An engineer looking at a DLQ record six hours after the fact should not need to cross-reference three different systems to understand what happened.

Set retention policies. DLQ data should not live forever. Set a retention period (30-90 days is common) and archive to cold storage if you need longer retention for compliance.

Review DLQ records regularly. Schedule a weekly review of DLQ contents. Patterns in DLQ records reveal upstream data quality issues that should be fixed at the source, not worked around in the pipeline.

Separate DLQs per pipeline. A single shared DLQ for all pipelines makes triage difficult. Each pipeline should have its own DLQ destination with clear naming conventions.

Test your DLQ path. Intentionally send a malformed record through your pipeline in a staging environment and verify that it lands in the DLQ with complete metadata. Platforms like Streamkap make this easier by providing built-in DLQ visibility, but regardless of your tooling, the DLQ path should be tested as rigorously as the happy path.

The question is never whether bad data will arrive in your pipeline. It will. The question is whether your pipeline handles it gracefully or falls over. A well-designed dead letter queue is the difference between a minor operational inconvenience and a full-blown data outage.

Platform

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company