<--- Back to all resources

Engineering

February 25, 2026

10 min read

Real-Time Data Deduplication: Eliminating Duplicates in Streams

Learn how to detect and eliminate duplicate records in real-time streaming pipelines. Implement deduplication with Flink SQL, Kafka, and idempotent sinks.

TL;DR: • Duplicates in streaming are inevitable - caused by at-least-once delivery, source retries, CDC snapshot overlap, and producer retries. The question is where and how to deduplicate. • Flink SQL's ROW_NUMBER with PARTITION BY and ORDER BY is the standard deduplication pattern - it keeps the latest (or first) record per key within a time window. • Idempotent sinks (UPSERT on primary key) provide a complementary last line of defense, ensuring duplicates do not corrupt destination state.

Duplicates are not a bug in streaming pipelines. They are a design consequence. Every system that prioritizes reliability over precision - and in distributed computing, that is nearly every system - will produce duplicate records under certain conditions. A Kafka producer retries a timed-out write that actually succeeded. A CDC connector restarts and replays the tail end of its last checkpoint. A consumer group rebalances and re-processes a batch of messages. The result is always the same: rows that appear more than once in your destination.

The cost of ignoring duplicates ranges from inflated metrics and double-counted revenue to corrupted aggregations that silently poison downstream models. The good news is that deduplication is a well-understood problem with layered solutions. This guide walks through where duplicates come from, how to eliminate them at each stage of a streaming pipeline, and how to combine strategies for defense in depth.

Sources of Duplicates

Understanding where duplicates originate is the first step to eliminating them. There is no single root cause. Instead, duplicates creep in at every boundary in a distributed pipeline.

At-Least-Once Delivery

Kafka’s default delivery guarantee is at-least-once. Producers send a message and wait for an acknowledgment from the broker. If the acknowledgment is lost (network timeout, broker failover), the producer retries. The broker may have already persisted the message, so now two copies exist on the topic. On the consumer side, if a process crashes after processing a message but before committing its offset, the message will be re-delivered when the consumer restarts.

Producer Retries

Even outside Kafka, any system that retries on failure can produce duplicates. An HTTP webhook fires, the receiving service processes it but returns a 503 due to a transient load balancer issue, and the sender retries. The event is now recorded twice.

CDC Snapshot-to-Stream Overlap

Change Data Capture connectors typically start by taking a snapshot of the source table, then switch to reading the change stream (WAL, binlog, oplog). There is an inherent overlap window: rows that were modified during the snapshot may appear in both the snapshot results and the change stream. The same row surfaces twice with slightly different metadata.

Pipeline Restarts and Checkpointing

Stream processors like Flink checkpoint their state periodically. If a job fails between checkpoints, it rolls back to the last successful checkpoint and replays input from that point. Any records processed after the checkpoint but before the failure will be processed again. This is by design - it guarantees no data loss - but it means downstream systems see duplicates unless the pipeline or the sink handles them.

Deduplication Strategies

There is no single silver bullet. Effective deduplication uses multiple strategies at different layers, each reducing the scope of duplicates that the next layer must handle.

Exactly-Once Semantics

Exactly-once processing eliminates duplicates within the pipeline itself. Kafka supports exactly-once semantics (EOS) through idempotent producers and transactional writes. Flink achieves exactly-once through its checkpoint-barrier protocol combined with transactional sinks. However, exactly-once only covers the boundary it controls. It cannot prevent duplicates that originate in the source system or from CDC overlap.

Stateful Deduplication

A stream processor maintains a set of seen keys in state. For each incoming record, it checks whether the key has been seen before. If yes, the record is dropped. If no, the key is added to state and the record is emitted. This is simple in concept but expensive in practice - state grows with the number of unique keys and must be bounded to avoid memory exhaustion.

Idempotent Sinks

An idempotent sink ensures that writing the same record multiple times produces the same result as writing it once. The most common implementation is UPSERT (INSERT … ON CONFLICT UPDATE) on a primary key. If a duplicate arrives, it overwrites the existing row rather than creating a new one. This is the last line of defense and often the simplest to implement.

Time-Windowed Deduplication

Rather than tracking every key ever seen, time-windowed dedup only tracks keys within a recent time window (e.g., the last 10 minutes). Records outside the window are assumed to be unique. This bounds state size at the cost of missing very late duplicates.

Flink SQL provides a clean, declarative pattern for deduplication using the ROW_NUMBER() window function. This is the standard approach recommended by the Flink documentation.

The ROW_NUMBER Pattern

The core idea is to partition records by their deduplication key, order them by a timestamp or sequence number, and keep only the first (or last) record per key.

-- Keep only the latest record per primary key
SELECT id, name, email, updated_at
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY id
      ORDER BY updated_at DESC
    ) AS row_num
  FROM user_events
)
WHERE row_num = 1;

This query assigns a row number to each record within its partition. ORDER BY updated_at DESC ensures the most recent record gets row_num = 1. The outer WHERE clause filters everything else.

For first-write-wins semantics (keep the earliest record), reverse the sort order:

-- Keep only the first record per primary key
SELECT id, name, email, created_at
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY id
      ORDER BY created_at ASC
    ) AS row_num
  FROM user_events
)
WHERE row_num = 1;

State Management

Flink maintains state for each unique key in the PARTITION BY clause. For high-cardinality keys (e.g., user IDs across millions of users), this state can grow large. Two levers control state size:

  • State TTL: Configure table.exec.state.ttl to automatically expire state entries after a duration. For example, setting TTL to 24 hours means Flink only deduplicates within a 24-hour window.
  • Processing-Time Dedup: Use PROCTIME() instead of event time to bound deduplication to the processing-time window, which is simpler to reason about.
-- Time-bounded dedup with processing time
SELECT id, payload, proc_time
FROM (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY id
      ORDER BY proc_time ASC
    ) AS row_num
  FROM (
    SELECT *, PROCTIME() AS proc_time FROM raw_events
  )
)
WHERE row_num = 1;

Kafka-Level Deduplication

Kafka itself provides mechanisms to prevent duplicates before they ever reach a stream processor.

Idempotent Producers

Enabling enable.idempotence=true on a Kafka producer assigns a sequence number to each message. The broker tracks these sequence numbers per partition and silently discards any message whose sequence number has already been committed. This eliminates duplicates caused by producer retries at zero cost to consumers.

# Producer configuration
enable.idempotence=true
acks=all
retries=2147483647
max.in.flight.requests.per.connection=5

Transactional Writes

For consume-transform-produce patterns, Kafka transactions ensure that reading from one topic and writing to another is atomic. Either all records in the transaction are committed, or none are. This prevents the scenario where a consumer processes and produces records, then crashes before committing its consumer offset - which would cause re-processing and duplicate output.

producer.initTransactions();
producer.beginTransaction();
// produce records
producer.sendOffsetsToTransaction(offsets, consumerGroupId);
producer.commitTransaction();

Consumer Isolation

Set isolation.level=read_committed on consumers to ensure they only see records from committed transactions. This pairs with transactional producers to provide end-to-end exactly-once within Kafka.

Destination-Level Deduplication

Even with pipeline-level deduplication, idempotent sinks provide an essential safety net. Different destinations offer different mechanisms.

UPSERT / MERGE Patterns

Most analytical databases support some form of upsert. The pattern is the same: use the primary key to determine whether to insert or update.

PostgreSQL:

INSERT INTO users (id, name, email, updated_at)
VALUES ($1, $2, $3, $4)
ON CONFLICT (id)
DO UPDATE SET name = EXCLUDED.name,
             email = EXCLUDED.email,
             updated_at = EXCLUDED.updated_at;

Snowflake MERGE:

MERGE INTO target_table t
USING staging_table s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET t.name = s.name, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (id, name, updated_at) VALUES (s.id, s.name, s.updated_at);

ClickHouse ReplacingMergeTree

ClickHouse takes a different approach. The ReplacingMergeTree engine stores all versions of a row and collapses duplicates during background merges. You specify a version column, and ClickHouse keeps only the row with the highest version.

CREATE TABLE users (
  id UInt64,
  name String,
  updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY id;

Note that deduplication is eventual - until a merge runs, queries may return duplicates. Use FINAL in queries when you need guaranteed deduplication:

SELECT * FROM users FINAL WHERE id = 12345;

Time-Bounded vs. Unbounded Dedup

The choice between time-bounded and unbounded deduplication is a fundamental tradeoff between correctness and resource consumption.

Unbounded dedup tracks every key ever seen. It guarantees that no duplicate will ever pass through, regardless of how late it arrives. The cost is state that grows monotonically with key cardinality. For a table with 100 million unique primary keys, the dedup state must hold 100 million entries.

Time-bounded dedup tracks keys only within a rolling window. State size is proportional to the number of unique keys within the window, not the total key space. A 1-hour window on a stream with 50,000 events per second and 80% key uniqueness requires state for roughly 144 million keys - still large, but bounded and predictable.

The right choice depends on your duplicate profile. If duplicates always arrive within seconds of the original (producer retries, consumer rebalances), a short window of 5-10 minutes is sufficient. If duplicates can arrive hours later (pipeline restarts, backfill jobs), you need a longer window or must rely on idempotent sinks as a fallback.

Practical Example: CDC Pipeline with Snapshot Overlap

Consider a CDC pipeline reading from PostgreSQL into a Kafka topic, processed by Flink, and written to Snowflake. During the initial snapshot, the connector reads 10 million rows. While the snapshot is in progress, 50,000 rows are updated. These 50,000 rows appear in both the snapshot and the WAL stream - creating duplicates on the Kafka topic.

Here is how layered deduplication handles this:

  1. Kafka idempotent producer: The CDC connector uses enable.idempotence=true, eliminating any duplicates from network retries during the snapshot.
  2. Flink SQL ROW_NUMBER: Flink deduplicates by primary key, keeping only the latest version of each row based on the WAL position or timestamp. The 50,000 overlapping rows are reduced to one version each.
  3. Snowflake MERGE: Streamkap writes to Snowflake using MERGE on primary key. Even if Flink’s state TTL causes a late duplicate to slip through, the MERGE statement updates the existing row rather than inserting a new one.

The result is a clean, deduplicated table in Snowflake that exactly mirrors the source PostgreSQL table.

Performance Considerations

Deduplication is not free. Every strategy carries a cost that scales with your data characteristics.

State size is the dominant concern for stateful dedup. Each key in state consumes memory (or disk, with RocksDB state backend). For Flink, this means choosing the right state backend: HashMapStateBackend for low-cardinality keys that fit in memory, RocksDBStateBackend for high-cardinality keys that must spill to disk.

Key cardinality directly determines state size. Deduplicating by a composite key (e.g., user_id + event_type + session_id) creates far more state entries than deduplicating by a simple primary key. Use the narrowest key that correctly identifies duplicates.

Dedup window duration affects both state size and correctness. Longer windows catch more duplicates but consume more resources. Profile your duplicate arrival pattern - if 99% of duplicates arrive within 60 seconds, a 5-minute window gives you ample margin without the cost of a 24-hour window.

Sink throughput is affected by UPSERT operations. An INSERT is generally faster than an UPSERT because the database must check for key conflicts. For high-volume sinks, batch your upserts and use staging tables with periodic MERGE rather than row-by-row conflict resolution.

Layered Deduplication: Defense in Depth

No single deduplication strategy covers every case. The most reliable pipelines combine multiple layers, each catching what the previous layer missed.

LayerMechanismCatches
Source / KafkaIdempotent producers, EOSProducer retries, network duplicates
Stream processorROW_NUMBER, stateful dedupCDC overlap, rebalance replays, backfill overlap
SinkUPSERT / MERGE on PKLate duplicates, state TTL gaps, pipeline restarts

The key insight is that each layer is cheap relative to the duplicates it prevents. Idempotent producers add negligible overhead. Flink’s ROW_NUMBER dedup is a single operator in the execution plan. UPSERT sinks add a primary key check per write. Together, they provide a guarantee that is stronger than any individual layer.

Streamkap implements this layered approach by default. Exactly-once delivery semantics handle pipeline-level duplicates, while primary key-based idempotent writes to destinations like Snowflake, ClickHouse, and PostgreSQL ensure that even edge-case duplicates from source systems or CDC overlap are resolved cleanly. The result is a destination table that faithfully mirrors the source - no missing rows, no extra rows.

Deduplication in streaming is not about achieving theoretical perfection at one layer. It is about accepting that duplicates are inevitable and building practical defenses at every boundary where they can appear.