<--- Back to all resources
Exactly-Once vs At-Least-Once: Choosing Delivery Guarantees
A practical comparison of exactly-once and at-least-once processing guarantees in stream processing. When each one matters, how they work, and what they actually cost.
Every stream processing system makes a promise about what happens when things go wrong. That promise is its delivery guarantee, and it determines whether your pipeline loses data, duplicates data, or handles failures cleanly. If you’ve spent time evaluating tools like Apache Kafka, Apache Flink, or managed platforms like Streamkap, you’ve likely run into the terms “at-most-once,” “at-least-once,” and “exactly-once.” These are not just theoretical labels. They shape how you design your pipelines, what failure modes you accept, and how much operational overhead you take on.
This article breaks down all three guarantees, explains how exactly-once actually works under the hood, and gives you a practical framework for deciding which one your pipeline needs.
The Three Delivery Guarantees
Before comparing exactly-once and at-least-once in detail, it helps to understand the full spectrum. There are three delivery semantics in stream processing, and each one makes a different trade-off between data loss, duplication, and system complexity.
At-Most-Once
At-most-once is the simplest guarantee. The system commits the consumer offset before processing the event. If the processor crashes after committing but before finishing its work, that event is gone. It will never be retried.
This is sometimes called “fire and forget.” It is the fastest option because it requires zero coordination between processing and offset management. But it accepts data loss as a normal failure mode.
When does this make sense? Almost never in a data pipeline. But it has a narrow role in high-volume, low-value streams - think application-level metrics or debug logs where losing a few events is acceptable and the volume makes retry overhead impractical.
At-Least-Once
At-least-once flips the order. The system processes the event first, then commits the offset. If the processor crashes after processing but before committing, the event will be replayed on recovery. The data is never lost, but it may be processed more than once.
This is the default behavior in most Kafka consumer configurations. It is also what Flink provides when checkpointing is enabled but the sink does not support two-phase commits. At-least-once guarantees are simpler to implement, impose less coordination overhead, and work well with a wide range of sinks.
The cost is duplicates. On failure recovery, the system replays events from the last committed offset, and some of those events were already processed successfully. Your sink sees them again.
Exactly-Once
Exactly-once means that even when failures occur, the end-to-end effect on the output is as if each event was processed one time and one time only. No data loss, no duplicates in the output.
This sounds like it should be the obvious choice. But achieving it requires tight coordination between the source, the processor, and the sink. That coordination has real costs in terms of latency, throughput, and operational complexity.
How Exactly-Once Actually Works
There is a common misconception that exactly-once means each event is literally processed only once inside the system. That is not what happens. During failure recovery, events are replayed. The processor does see them again. The guarantee is about the observable output, not the internal execution.
The two primary mechanisms that make exactly-once possible in Apache Flink are distributed snapshots (checkpointing) and two-phase commit protocols on the sink.
Flink’s Checkpoint Mechanism
Flink uses a variant of the Chandy-Lamport algorithm to take consistent snapshots of the entire pipeline state. Here is how it works:
- The JobManager periodically injects checkpoint barriers into the source streams.
- These barriers flow through the dataflow graph alongside regular events.
- When an operator receives barriers from all its input channels, it snapshots its local state to durable storage (typically a distributed filesystem or object store like S3).
- Once all operators have completed their snapshots, the checkpoint is considered complete.
If a failure occurs, Flink restores all operators to their most recent completed checkpoint. The source rewinds to the offsets captured in that checkpoint, and processing resumes. Events between the last checkpoint and the failure are replayed, but since the operator state was also restored to the checkpoint, the replayed processing produces the same output as the original run.
This gives you exactly-once within the Flink pipeline. But the output is a different story.
Two-Phase Commits for End-to-End Guarantees
The checkpoint mechanism alone does not prevent duplicates in external sinks. If Flink wrote results to a database before a failure and then replayed those events after recovery, the database would see duplicates.
To solve this, Flink uses a two-phase commit (2PC) protocol with sinks that support transactions:
- Pre-commit phase: During normal processing, Flink writes output to the sink inside an open transaction. The data is written but not yet visible to downstream consumers.
- Commit phase: When a checkpoint completes successfully, Flink commits the transaction. Only then does the output become visible.
- Abort on failure: If a checkpoint fails, the uncommitted transaction is rolled back. No duplicate data reaches the sink.
This is how Flink achieves exactly-once with Kafka as a sink. The Kafka producer uses transactional writes, and consumers configured with isolation.level=read_committed only see committed data. The same pattern works with other transactional sinks.
The catch: your sink must support transactions. Not all do. And even when they do, the two-phase commit adds latency because output is not visible until the next checkpoint completes.
The End-to-End Picture: Kafka + Flink
A common production topology is Kafka as the source, Flink as the processor, and Kafka (or a database) as the sink. Here is how delivery guarantees play out across this stack:
Source side (Kafka to Flink): Flink stores Kafka consumer offsets as part of its checkpoint state. On recovery, it rewinds the consumer to the checkpointed offsets. This means the source side naturally supports exactly-once - the offset and the processing state are atomically consistent.
Processing (inside Flink): Checkpointing ensures that operator state and position in the stream are always consistent. Replayed events produce identical results because the state was rolled back to match.
Sink side (Flink to Kafka): Flink’s Kafka producer uses transactions. Events are written to Kafka inside a transaction that is committed only when the checkpoint completes. Downstream consumers using read_committed isolation see each event exactly once.
Sink side (Flink to a database): If the database supports transactions and the Flink sink implements the TwoPhaseCommitSinkFunction, you get end-to-end exactly-once. If the database does not support transactions, you fall back to at-least-once.
This is important: exactly-once is a property of the entire pipeline, not just one component. A single link in the chain that does not support it breaks the guarantee for the whole path.
The Performance Trade-Off
Exactly-once is not free. The coordination required for checkpointing and two-phase commits has measurable effects on latency and throughput.
Checkpointing Overhead
Checkpoints require snapshotting operator state to durable storage. For operators with large state (like windowed aggregations or joins), this can produce periodic latency spikes. The size of these spikes depends on state size, storage backend speed, and whether you use incremental checkpoints.
Typical checkpoint intervals range from 1 second to several minutes. Shorter intervals give faster recovery (less data to replay) but increase the frequency of state snapshots. Longer intervals reduce overhead but mean more data must be replayed after a failure.
Output Latency
With two-phase commits, output is not visible to downstream consumers until the checkpoint completes. If your checkpoint interval is 10 seconds, your output latency has a floor of 10 seconds - even if the actual processing takes milliseconds. This is a hard constraint that many teams overlook when evaluating exactly-once.
With at-least-once, output can be emitted immediately. There is no need to wait for a checkpoint boundary. For latency-sensitive applications, this difference matters.
Throughput Impact
Checkpointing and transactional writes both consume resources - CPU for serialization, network bandwidth for state transfer, disk I/O on the storage backend. In high-throughput pipelines, this overhead can reduce effective throughput by 10-30% compared to at-least-once processing, depending on state size and checkpoint frequency.
Idempotent Sinks: The Practical Alternative
Here is where the conversation gets interesting. Many teams default to exactly-once because it sounds like the “correct” option, without considering whether their sink already handles duplicates naturally.
An idempotent operation produces the same result whether you execute it once or multiple times. If your sink is idempotent, at-least-once delivery gives you the same end result as exactly-once - without the coordination overhead.
Common Idempotent Patterns
- Database upserts with a primary key: Writing
UPDATE users SET email = 'x' WHERE id = 123twice produces the same result as writing it once. The second write simply overwrites with the same value. - Key-value stores with deterministic keys: Putting the same key-value pair into Redis or DynamoDB twice leaves the store in the same state.
- Overwrites to object storage: Writing a file to S3 with the same key replaces the previous version. If the content is the same, the result is identical.
Non-Idempotent Patterns
- Append-only logs: Inserting a row into a table without deduplication creates a duplicate.
- Counters and accumulators: Incrementing a counter twice when you meant to increment it once gives the wrong total.
- Sending emails or push notifications: You cannot un-send a message. Duplicates here mean duplicate notifications to users.
If your sink falls into the idempotent category, at-least-once with idempotent writes is often the better engineering choice. You get simpler operations, lower latency, higher throughput, and equivalent correctness.
Streamkap uses this pattern in its managed CDC pipelines. By combining at-least-once delivery with idempotent sink writes (like upserts keyed on primary keys), Streamkap delivers correct results without requiring the full coordination overhead of two-phase commits. This approach keeps latency low while still ensuring data accuracy in the destination.
A Decision Framework
When you are choosing a delivery guarantee for a new pipeline, work through these questions:
1. Can your sink tolerate duplicates? If your sink supports upserts or is naturally idempotent, at-least-once is likely sufficient. You get lower latency and simpler operations.
2. Is your output append-only? If you are appending events to a log, a data lake, or a table without natural deduplication, duplicates will accumulate. You need either exactly-once or a deduplication layer between the processor and the sink.
3. What is your latency budget? If you need sub-second output visibility, two-phase commits will be a bottleneck. At-least-once with idempotent sinks gives you faster output.
4. How large is your operator state? Large state means expensive checkpoints. If your pipeline maintains gigabytes of state for windowed joins, checkpoint overhead is significant. Weigh the cost of exactly-once against the cost of handling occasional duplicates downstream.
5. Does your entire sink chain support transactions? Exactly-once requires every sink in the pipeline to participate in the two-phase commit. If even one sink does not support transactions, you have at-least-once to that sink regardless of what the rest of the pipeline does.
6. What is the business cost of a duplicate vs. a delay? For financial transactions, duplicates can mean double charges - exactly-once is worth the latency. For analytics dashboards, a temporary duplicate row that gets corrected on the next upsert is harmless.
Common Misconceptions
“Exactly-once means each event is processed exactly once.” No. Events may be processed multiple times during recovery. The guarantee is that the effect on the output is as if each event was processed once. Internal retries happen; they are just invisible to the outside world.
“At-least-once always produces duplicates.” Not during normal operation. Duplicates only appear during failure recovery, when the system replays events from the last checkpoint. If your pipeline runs without failures (which is most of the time), at-least-once produces no duplicates at all.
“You should always choose exactly-once to be safe.” This is like saying you should always use serializable isolation in your database. It is the strongest guarantee, but it comes with real performance costs. Many production pipelines run at-least-once with idempotent sinks and produce perfectly correct results.
“Kafka alone provides exactly-once.” Kafka’s transactional producer and consumer give you exactly-once between Kafka topics. But if your pipeline reads from Kafka, processes data, and writes to an external database, you need the processor (like Flink) to coordinate the exactly-once guarantee end to end. Kafka’s built-in guarantee covers the Kafka-to-Kafka path, not the full pipeline.
Putting It Into Practice
The delivery guarantee you choose should match the actual requirements of your use case, not an abstract ideal. Start by understanding your sink. If it is idempotent, at-least-once with proper key design gives you correctness with minimal overhead. If it is not idempotent and duplicates have real business consequences, invest in exactly-once and accept the latency and complexity trade-off.
For teams building CDC pipelines or real-time data integration, platforms like Streamkap handle this complexity internally - choosing the right delivery strategy based on the source and destination characteristics so you do not have to manage checkpoint tuning and transactional sinks yourself.
The best delivery guarantee is the one that gives you correct results at the latency and operational cost your team can sustain. Sometimes that is exactly-once. Often, it is at-least-once with a well-designed sink.