Is exactly-once delivery truly possible in distributed systems?

Strictly speaking, no — the Two Generals' Problem proves you cannot guarantee exactly-once delivery across a network. What systems like Kafka implement is 'effectively-once' semantics: they use idempotent writes and atomic transactions so that even if messages are retried, the end result is as if each message was processed exactly once.

What is the difference between idempotent producers and transactional producers in Kafka?

Idempotent producers prevent duplicate writes from a single producer to a single partition by tagging records with a producer ID and sequence number. Transactional producers go further — they allow atomic writes across multiple partitions and topics, and they coordinate with consumer offset commits so that read-process-write cycles are atomic.

When should I use exactly-once vs at-least-once semantics?

Use exactly-once when duplicates would cause incorrect results that cannot be fixed downstream — financial transactions, inventory counts, billing events. Use at-least-once with idempotent sinks when your destination can handle duplicates naturally (upserts to a database, writes to a key-value store with natural deduplication).

Does exactly-once processing affect throughput?

Yes. Transactional commits add latency (typically 50-200ms per transaction) and reduce maximum throughput compared to at-least-once. The overhead comes from the two-phase commit protocol and the need to flush records before committing. For most workloads, the impact is acceptable, but high-throughput, latency-sensitive pipelines may prefer at-least-once with idempotent sinks.

<--- Back to all resources

Architecture & Patterns

May 22, 2025

11 min read

Exactly-Once Delivery: How It Actually Works Under the Hood

A deep technical breakdown of exactly-once delivery in streaming systems — idempotent producers, Kafka transactions, consumer offsets, and when you need it.

“Exactly-once delivery” is one of the most misunderstood concepts in distributed systems. Engineers either dismiss it as impossible (citing the Two Generals’ Problem) or assume it means something magical is happening at the network layer. The truth is more practical: exactly-once is an application-level guarantee built on top of at-least-once delivery, using idempotence and atomic transactions to eliminate the visible effects of duplicates.

This article breaks down how it actually works — from idempotent producers to Kafka transactions to end-to-end exactly-once across a complete streaming pipeline.

Why “Effectively-Once” Is the Right Mental Model

At the network level, you have two options:

At-most-once: Send a message and do not retry. The message might be lost.
At-least-once: Send a message and retry until acknowledged. The message might be delivered multiple times.

There is no “send exactly once” at the TCP/network layer. A sender cannot know whether the receiver processed a message if the acknowledgment was lost. So the sender retries, and now the receiver has seen the message twice.

Exactly-once systems accept this reality and add a layer on top: they make duplicates invisible. The receiver detects and discards duplicates so that the end state is identical to what it would be if each message arrived exactly once.

This is why the Kafka community sometimes calls it “effectively-once” — the processing outcome is the same as exactly-once, even though the underlying transport may deliver messages more than once.

Building Block 1: Idempotent Producers

The first piece is preventing duplicate writes from the producer to the broker.

The Problem

Without idempotence, here is what can go wrong:

Producer sends record to broker
Broker writes record, sends ACK
ACK is lost (network issue)
Producer retries, sends the same record again
Broker writes a duplicate

The topic now has two copies of the same record.

The Solution: Producer ID + Sequence Numbers

When you enable idempotence, Kafka assigns each producer instance a unique Producer ID (PID). The producer tags every record with its PID and a monotonically increasing sequence number per partition.

The broker tracks the last sequence number it accepted from each PID for each partition. If a record arrives with a sequence number that has already been written, the broker discards it and returns a success response (since it was already persisted).

# Kafka producer configuration for idempotent writes
producer_config = {
    'bootstrap.servers': 'kafka:9092',
    'enable.idempotence': True,    # Enable idempotent producer
    'acks': 'all',                  # Required for idempotence
    'max.in.flight.requests.per.connection': 5,  # Max 5 with idempotence
    'retries': 2147483647,          # Retry indefinitely
}

With this configuration:

enable.idempotence=true activates PID and sequence number tracking
acks=all ensures the record is replicated before acknowledgment (required)
max.in.flight.requests.per.connection is capped at 5 to maintain ordering guarantees
Retries are effectively unlimited — duplicates are handled by the broker

What Idempotent Producers Do NOT Solve

Idempotent producers only prevent duplicates from a single producer to a single partition within a single session. They do not help with:

Cross-partition atomicity: Writing to partition A and partition B is not atomic
Read-process-write cycles: Reading from one topic, processing, and writing to another is not atomic
Producer restarts: A new producer instance gets a new PID, so it cannot detect duplicates from the previous instance (unless transactions are used)

For those, you need transactions.

Building Block 2: Kafka Transactions

Transactions extend idempotence to support atomic writes across multiple partitions and topics, and to coordinate consumer offset commits with produced records.

How the Transaction Protocol Works

Kafka transactions use a two-phase commit protocol coordinated by a Transaction Coordinator — a broker that manages transaction state for a given transactional.id.

Here is the sequence:

Producer                    Transaction Coordinator           Brokers
   │                                  │                          │
   │── InitTransactions ──────────>   │                          │
   │   (register transactional.id)    │                          │
   │<── PID + epoch ──────────────    │                          │
   │                                  │                          │
   │── BeginTransaction ──────────>   │                          │
   │                                  │                          │
   │── Produce(topic-A, partition-0) ─┼──────────────────────>   │
   │── Produce(topic-B, partition-1) ─┼──────────────────────>   │
   │── AddOffsetsToTxn ──────────->   │                          │
   │── TxnOffsetCommit ──────────────┼──────────────────────>   │
   │                                  │                          │
   │── CommitTransaction ─────────>   │                          │
   │                                  │── WriteTxnMarker ────>   │
   │                                  │   (COMMIT to all         │
   │                                  │    partitions)           │
   │<── Success ──────────────────    │                          │

Step by step:

InitTransactions: The producer registers its transactional.id with the coordinator. The coordinator assigns a PID and an epoch number. If a previous producer with the same transactional.id exists, its epoch is fenced (any in-flight transactions from the old producer are aborted).
BeginTransaction: Marks the start of a new transaction locally.
Produce records: The producer writes records to any number of partitions and topics. These records are written to the log but are not yet visible to consumers (depending on isolation level).
AddOffsetsToTxn + TxnOffsetCommit: The producer tells the coordinator which consumer group’s offsets are part of this transaction, then commits those offsets transactionally.
CommitTransaction: The producer asks the coordinator to commit. The coordinator writes a COMMIT marker to all partitions involved in the transaction.

Transactional Producer Pseudocode

from confluent_kafka import Producer

config = {
    'bootstrap.servers': 'kafka:9092',
    'transactional.id': 'my-processor-0',  # Stable across restarts
    'enable.idempotence': True,
}

producer = Producer(config)
producer.init_transactions()

try:
    producer.begin_transaction()

    # Produce output records
    producer.produce('output-topic', key=key, value=transformed_value)
    producer.produce('audit-topic', key=key, value=audit_event)

    # Commit consumer offsets as part of the transaction
    producer.send_offsets_to_transaction(
        consumer.position(consumer.assignment()),
        consumer.consumer_group_metadata()
    )

    producer.commit_transaction()

except Exception as e:
    producer.abort_transaction()
    raise

The transactional.id is important — it must be stable across producer restarts. When a producer with the same transactional.id starts up, Kafka fences the old producer (bumps the epoch) and aborts any incomplete transactions from the previous instance. This prevents zombie producers from committing stale transactions.

Consumer Isolation Levels

Transactions only work if consumers respect them. Kafka consumers have two isolation levels:

read_uncommitted (default): The consumer sees all records, including those in incomplete transactions. This is the same as having no transaction support.
read_committed: The consumer only sees records from committed transactions. Records from aborted transactions are skipped.

consumer_config = {
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'my-consumer-group',
    'isolation.level': 'read_committed',  # Only see committed records
    'enable.auto.commit': False,           # Offsets managed by transactions
}

You must set isolation.level=read_committed on downstream consumers for exactly-once to work end-to-end. If any consumer in the chain uses read_uncommitted, it will see duplicates from aborted-then-retried transactions.

Building Block 3: End-to-End Exactly-Once

The three building blocks — idempotent producers, transactions, and committed-read consumers — combine to create end-to-end exactly-once semantics for a read-process-write pipeline:

Source Topic ──> Consumer ──> Process ──> Producer ──> Sink Topic
                   │                        │
                   └────── Same Transaction ──┘

The atomic unit is: read input records + process them + write output records + commit input offsets. All of this happens in a single Kafka transaction.

If any step fails:

The transaction is aborted
Output records become invisible (consumers with read_committed skip them)
Input offsets are not committed
On restart, the consumer re-reads the same input records and reprocesses them

The result: every input record is processed exactly once, because the output write and the offset commit are atomic.

What This Looks Like in Practice

A typical stream processing application:

consumer = Consumer(consumer_config)
producer = Producer(producer_config)
producer.init_transactions()
consumer.subscribe(['input-topic'])

while True:
    records = consumer.poll(timeout=1.0, max_records=100)
    if not records:
        continue

    producer.begin_transaction()
    try:
        for record in records:
            result = process(record)
            producer.produce('output-topic', key=result.key, value=result.value)

        producer.send_offsets_to_transaction(
            consumer.position(consumer.assignment()),
            consumer.consumer_group_metadata()
        )
        producer.commit_transaction()

    except Exception:
        producer.abort_transaction()
        # Consumer will re-read records on next poll

The Source and Sink Problem

Kafka transactions give you exactly-once within Kafka — from topic to topic. But most real pipelines start at a database and end at a warehouse, search index, or cache. The boundaries are where things get tricky.

Source Side: CDC and Exactly-Once

CDC (change data capture) engines read database transaction logs. The log position (LSN in PostgreSQL, binlog offset in MySQL) serves as the source-side cursor. For exactly-once, the CDC engine must atomically commit the log position along with the produced records.

If the CDC engine crashes after producing records but before saving the log position, it will re-read and re-produce those records on restart. Kafka’s idempotent producer handles this for within-session retries, but a full restart means a new producer session.

The practical approach: use transactional producers in the CDC engine, storing the log position as part of the Kafka transaction. On restart, the engine reads the last committed position from Kafka (or from a separate metadata store) and resumes from there.

Sink Side: Idempotent Writes

The sink — the final destination — is outside Kafka’s transaction boundary. You cannot do a distributed transaction across Kafka and Snowflake, or Kafka and Elasticsearch. Instead, the standard approach is idempotent sinks:

Upsert by primary key: Write records using the source primary key as the destination key. Duplicates overwrite with the same data.
Deduplication table: Track processed offsets or record IDs in the destination. Skip records that have already been applied.
Transactional outbox in the sink: If the destination supports transactions (like a relational database), commit the data write and the offset update in the same database transaction.

The combination of exactly-once within Kafka and idempotent sinks gives you effectively-once end-to-end.

When You Actually Need Exactly-Once

Exactly-once comes with overhead. Transactions add latency (the two-phase commit), reduce maximum throughput, and add complexity to your application code. So when is it worth it?

You Need It When:

Financial transactions: Double-charging a customer or double-counting revenue is not acceptable
Inventory management: Overcounting or undercounting stock causes real-world problems
Billing and metering: Usage-based billing must be accurate
Stateful aggregations: Running totals, counters, and windowed aggregations will drift if duplicates are counted

You Can Skip It When:

Idempotent destinations: Upserting to a database with a natural key handles duplicates automatically
Append-only analytics: If your warehouse deduplicates during query time (e.g., with QUALIFY ROW_NUMBER() in Snowflake), at-least-once is fine
Notifications and alerts: Sending an alert twice is usually better than not sending it at all
Search index updates: Elasticsearch upserts by document _id are naturally idempotent

For a deeper comparison, see the exactly-once vs at-least-once guide.

Performance Implications

Some real numbers to set expectations:

Metric	At-Least-Once	Exactly-Once (Transactional)
Produce latency (p99)	5-15ms	50-200ms (transaction commit)
Throughput	100% baseline	70-85% of baseline
Broker CPU overhead	Baseline	+10-15% (transaction tracking)

The biggest hit is transaction commit latency. Each commit_transaction() call requires a round trip to the transaction coordinator and COMMIT markers written to all involved partitions. Batching more records per transaction amortizes this cost.

A common pattern: commit transactions every N records or every T milliseconds, whichever comes first. Larger batches = better throughput but higher latency.

BATCH_SIZE = 500
BATCH_TIMEOUT_MS = 1000

# Commit when batch is full or timeout expires
if len(batch) >= BATCH_SIZE or time_since_last_commit() > BATCH_TIMEOUT_MS:
    producer.send_offsets_to_transaction(...)
    producer.commit_transaction()
    producer.begin_transaction()
    batch.clear()

Common Pitfalls

Forgetting read_committed on consumers. If any downstream consumer reads uncommitted records, you lose exactly-once guarantees for that consumer even though the producer side is correct.

Unstable transactional.id. If the transactional ID changes across restarts (e.g., it includes a random UUID), the new producer cannot fence the old one. Use a deterministic ID based on the processing partition assignment.

Transaction timeout too short. The default transaction.timeout.ms is 60 seconds. If your processing takes longer than this for a batch, the coordinator aborts the transaction. Increase it for heavy processing, but keep it reasonable — long-running transactions hold resources.

Mixing transactional and non-transactional producers. If you write to a topic with both transactional and non-transactional producers, consumers with read_committed will see non-transactional records immediately but must wait for transactional records to be committed. This can cause confusing ordering behavior.

Choosing Your Guarantee Level

The right choice depends on your specific pipeline. Many production streaming architectures use at-least-once processing with idempotent sinks — it is simpler, faster, and sufficient for most workloads. Reserve exactly-once for the pipelines where duplicates would cause real business impact.

Start with at-least-once. Add exactly-once where the cost of duplicates exceeds the cost of the additional complexity and latency.

Ready to build streaming pipelines with strong delivery guarantees? Streamkap handles exactly-once and at-least-once delivery under the hood, so you can focus on your data rather than transaction coordination. Start a free trial or learn more about exactly-once vs at-least-once delivery.

Products

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company