<--- Back to all resources
Webhook to Kafka: Reliable Ingestion at Scale
How to build a reliable webhook ingestion layer using Kafka as a durable buffer — covering deduplication, ordering, retries, and dead letter queues.
Webhooks are the default integration pattern for SaaS platforms. Stripe sends payment events, GitHub sends push notifications, Shopify sends order updates. The pattern is simple: the sender fires an HTTP POST to your endpoint whenever something happens.
The problem starts when you need to process these events reliably at scale. A single slow consumer, a brief outage, or a traffic spike can cause dropped events — and most webhook senders have limited retry windows. Once that window closes, the data is gone.
This guide covers how to build a webhook ingestion layer that never drops events, handles duplicates correctly, and scales independently of your processing logic.
The Core Architecture
The key insight is separating receipt from processing. Your webhook endpoint should do exactly two things:
- Validate the request (auth, schema)
- Write it to a durable store
- Return 200 immediately
That durable store is Kafka. Everything downstream — transformation, enrichment, loading into databases — happens asynchronously by consuming from Kafka topics.
Webhook Sender → [Your Endpoint] → Kafka Topic → [Consumer A: Analytics DB]
→ [Consumer B: Cache Update]
→ [Consumer C: Notification Service]
This gives you several properties that a direct webhook-to-processing pipeline doesn’t:
- Durability: Events survive consumer downtime
- Replay: You can reprocess historical events by resetting consumer offsets
- Fan-out: Multiple consumers can independently process the same events
- Backpressure: Consumers process at their own rate without affecting receipt
The Webhook Receiver
Here’s a minimal webhook receiver in Python that validates, deduplicates, and produces to Kafka:
from fastapi import FastAPI, Request, HTTPException
from aiokafka import AIOKafkaProducer
import hashlib
import json
import redis.asyncio as redis
app = FastAPI()
producer: AIOKafkaProducer = None
redis_client: redis.Redis = None
DEDUP_TTL_SECONDS = 86400 # 24 hours
@app.on_event("startup")
async def startup():
global producer, redis_client
producer = AIOKafkaProducer(
bootstrap_servers="kafka:9092",
acks="all", # Wait for all replicas
enable_idempotence=True, # Exactly-once producer
value_serializer=lambda v: json.dumps(v).encode()
)
await producer.start()
redis_client = redis.Redis(host="redis", port=6379)
@app.post("/webhooks/{source}")
async def receive_webhook(source: str, request: Request):
body = await request.body()
payload = json.loads(body)
# 1. Validate authentication (HMAC, API key, etc.)
if not validate_signature(request, body, source):
raise HTTPException(status_code=401, detail="Invalid signature")
# 2. Extract or generate idempotency key
idempotency_key = extract_idempotency_key(payload, request.headers)
if idempotency_key is None:
idempotency_key = hashlib.sha256(body).hexdigest()
# 3. Check for duplicates
if await redis_client.exists(f"webhook:seen:{idempotency_key}"):
return {"status": "already_processed"} # Still return 200
# 4. Produce to Kafka with entity-based key for ordering
entity_key = extract_entity_key(payload, source)
await producer.send_and_wait(
topic=f"webhooks.{source}",
key=entity_key.encode() if entity_key else None,
value={
"source": source,
"idempotency_key": idempotency_key,
"received_at": datetime.utcnow().isoformat(),
"payload": payload,
"headers": dict(request.headers),
}
)
# 5. Mark as seen (with TTL so Redis doesn't grow forever)
await redis_client.setex(
f"webhook:seen:{idempotency_key}",
DEDUP_TTL_SECONDS,
"1"
)
return {"status": "accepted"}
A few things to note:
- The endpoint always returns 200 (or a non-retryable 4xx for auth failures). Returning 5xx tells the sender to retry, which you want to avoid if you’ve already persisted the event.
- The idempotency check happens before producing to Kafka, so duplicate sends from the webhook source get caught early.
acks="all"andenable_idempotence=Trueon the Kafka producer ensures the event is durably written before we ACK the webhook.
Deduplication with Idempotency Keys
Duplicate webhook deliveries are not an edge case — they’re expected behavior. Senders retry on timeouts, network blips cause double-delivery, and some platforms explicitly document at-least-once delivery.
Most webhook providers include a unique event identifier:
| Provider | Idempotency Key Header/Field |
|---|---|
| Stripe | Stripe-Webhook-Id header |
| GitHub | X-GitHub-Delivery header |
| Shopify | X-Shopify-Webhook-Id header |
| Twilio | MessageSid in payload |
If the sender doesn’t provide one, generate a deterministic key by hashing the payload body. This catches exact-duplicate deliveries, though it won’t catch semantically duplicate events with different timestamps.
For more on deduplication patterns in streaming systems, see Data Deduplication in Streaming Pipelines.
Where to Store Seen Keys
Redis with a TTL is the simplest approach. The TTL should be longer than the sender’s maximum retry window (typically 24-72 hours). At scale, a Bloom filter can reduce memory usage at the cost of a small false-positive rate — acceptable if a rare duplicate getting through is tolerable.
For exactly-once guarantees all the way through, you need idempotency at the consumer level too.
Ordering: What You Get and What You Don’t
Webhooks have no inherent ordering guarantee. If a sender fires three events for the same order — order.created, order.updated, order.fulfilled — they might arrive in any sequence depending on network conditions.
Kafka gives you per-partition ordering. By choosing a partition key carefully, you can guarantee that events for the same entity are processed in order:
def extract_entity_key(payload: dict, source: str) -> str:
"""Extract entity ID for Kafka partition key.
Events with the same key go to the same partition,
guaranteeing ordering within that entity.
"""
if source == "stripe":
# All events for the same customer stay ordered
return payload.get("data", {}).get("object", {}).get("customer", "")
elif source == "github":
# All events for the same repo stay ordered
return payload.get("repository", {}).get("full_name", "")
elif source == "shopify":
# All events for the same order stay ordered
return str(payload.get("id", ""))
return ""
This means order.created and order.fulfilled for order #1234 will always be in the same partition and processed in order. But events for order #1234 and order #5678 might be in different partitions and processed in parallel.
If you need strict global ordering across all entities, you’d need a single partition — which kills parallelism. In practice, per-entity ordering is almost always sufficient.
Retry Handling
There are two sides to retries: the sender retrying delivery to your endpoint, and your consumers retrying failed processing.
Sender-Side Retries
You generally don’t control the sender’s retry behavior, but you can influence it:
- Return 200 quickly (under 5 seconds). Most senders have aggressive timeouts.
- Return 200 even for duplicates. A 200 tells the sender “I got it, stop retrying.”
- Never return 5xx for payload validation failures. Use 4xx. A 5xx triggers retries that will never succeed for a malformed payload.
Consumer-Side Retries
When your Kafka consumer fails to process an event (database down, external API error, bug in transformation logic), you have several options:
async def process_with_retry(event: dict, max_retries: int = 3):
"""Process an event with exponential backoff.
After max_retries, route to dead letter queue.
"""
for attempt in range(max_retries):
try:
await process_event(event)
return
except TransientError as e:
wait_time = min(2 ** attempt, 30) # 1s, 2s, 4s... max 30s
logger.warning(
f"Transient error processing {event['idempotency_key']}, "
f"attempt {attempt + 1}/{max_retries}, "
f"retrying in {wait_time}s: {e}"
)
await asyncio.sleep(wait_time)
except PermanentError as e:
logger.error(f"Permanent error, routing to DLQ: {e}")
break
# All retries exhausted or permanent error
await route_to_dlq(event)
The key distinction is between transient errors (network timeout, rate limit, temporary unavailability) and permanent errors (malformed data, missing required fields, schema mismatch). Only retry transient errors.
Dead Letter Queues
A dead letter queue (DLQ) is a separate Kafka topic where events that fail processing are sent. This prevents a single bad event from blocking the entire pipeline — a pattern sometimes called “poison pill” handling.
DLQ_TOPIC = "webhooks.dlq"
async def route_to_dlq(event: dict, error: str = ""):
"""Route a failed event to the dead letter queue."""
dlq_event = {
"original_event": event,
"error": str(error),
"failed_at": datetime.utcnow().isoformat(),
"original_topic": event.get("_topic", "unknown"),
}
await producer.send_and_wait(
topic=DLQ_TOPIC,
key=event.get("idempotency_key", "").encode(),
value=dlq_event
)
DLQ events should include enough context to debug and reprocess later: the original event, the error message, the timestamp, and which topic/consumer failed. For a deeper look at DLQ patterns, see Dead Letter Queues in Streaming Systems.
Reprocessing from the DLQ
Build tooling to inspect and replay DLQ events. A simple approach:
- A CLI tool or admin UI that lists DLQ events with filters (source, time range, error type)
- A “replay” command that moves selected events back to their original topic
- Alerting when the DLQ grows beyond a threshold
Scaling the Receiver
The webhook receiver itself should be stateless — all state lives in Redis (dedup) and Kafka (events). This makes horizontal scaling straightforward:
- Run multiple instances behind a load balancer
- Each instance has its own Kafka producer (producers are thread-safe but not process-safe)
- Redis handles dedup coordination across instances automatically
For high-throughput sources (>10K events/second), consider:
- Batching Kafka produces: Use
linger_msandbatch_sizeto batch multiple webhook events into a single Kafka produce request - Connection pooling: Reuse Redis and Kafka connections across requests
- Async all the way: The example above uses async Python; avoid blocking calls in the request path
Monitoring
Instrument these metrics at minimum:
| Metric | Why |
|---|---|
webhook_received_total (by source) | Volume tracking, anomaly detection |
webhook_duplicates_total (by source) | Sender retry behavior, dedup effectiveness |
webhook_produce_latency_ms | Kafka health from the producer side |
webhook_produce_errors_total | Kafka availability issues |
dlq_events_total (by source, error type) | Processing failure rate |
consumer_lag (by topic, consumer group) | Processing falling behind receipt |
Consumer lag is especially important. For a thorough treatment, see Understanding Kafka Consumer Lag.
When You Don’t Need to Build This
Building and maintaining a webhook ingestion layer is real operational work: the receiver, the dedup store, the Kafka cluster, the DLQ pipeline, monitoring, and alerting.
If you’re already using a managed streaming platform, webhook-to-Kafka ingestion may be available as a managed connector. For example, Streamkap provides a webhook source connector that handles receipt, dedup, and Kafka production without custom code.
Choosing the Right Approach for Your Team
The build-vs-buy decision depends on your scale and requirements:
- Low volume (<1K events/day), single source: A simple webhook handler writing to your database might be enough. Kafka is overkill.
- Medium volume, multiple sources: The architecture described here pays for itself in reliability and debuggability.
- High volume or strict SLAs: Consider a managed solution that handles the infrastructure and gives you observability out of the box.
Regardless of approach, the principles remain: ACK immediately, buffer durably, process asynchronously, handle duplicates, and route failures to a DLQ. Get these right and webhook ingestion stops being a source of data loss.
Ready to stream webhook data into Kafka without building custom infrastructure? Streamkap’s managed webhook source connector handles ingestion, deduplication, and delivery to Kafka topics with zero custom code. Start a free trial or learn more about webhook-to-Kafka streaming.