<--- Back to all resources

Engineering

February 25, 2026

10 min read

Building a Real-Time Notification Engine with Stream Processing

Learn how to build a real-time notification engine using Apache Flink. Covers event-driven notifications, deduplication, rate limiting, multi-channel delivery, and user preference filtering.

TL;DR: - A notification engine built on Flink can react to events in milliseconds, deduplicating and rate-limiting before delivery rather than after. - Keyed state in Flink tracks per-user notification history, enabling intelligent suppression without external caches. - Multi-channel routing (email, push, SMS, in-app) is handled by splitting the stream based on user preferences stored in Flink state. - Streamkap feeds the event stream into Kafka via CDC, so database changes trigger notifications automatically without polling.

Every modern application sends notifications. Order confirmations, fraud alerts, price drop warnings, system health alarms, onboarding nudges. Most teams start by triggering notifications from application code: a user does something, the backend fires off an email or push notification inline. That works at small scale. It stops working when you need to coordinate across multiple event sources, deduplicate redundant alerts, respect per-user rate limits, and deliver to multiple channels based on user preferences.

At that point, you need a notification engine. Not a simple queue-and-send system, but something that can process a high-volume event stream, apply rules in real time, and route the right notification to the right user on the right channel at the right time.

Apache Flink is a strong fit for this because the problem is fundamentally a stream processing problem. Events flow in continuously. Each event may or may not trigger a notification depending on the user’s history, preferences, and how recently they were last notified. All of that state needs to be maintained and queried with every event. That is what Flink does well.

The Architecture of a Notification Engine

A notification engine sits between your event sources and your delivery channels. Its job is to answer a series of questions for every incoming event:

  1. Does this event warrant a notification?
  2. Has the user already been notified about this (deduplication)?
  3. Has the user been notified too recently (rate limiting)?
  4. Which channels should this notification go to (user preferences)?
  5. What is the content and priority of the notification?

In a Flink-based architecture, each of these questions maps to a processing step, and the state required to answer them lives in Flink’s managed keyed state.

The data flow looks like this:

Event sources (application databases, user activity streams, external systems) produce events to Kafka topics. Streamkap handles the CDC side of this, capturing row-level changes from databases like PostgreSQL and MySQL and delivering them to Kafka. This means a new order row, a changed account status, or an updated subscription record automatically becomes an event in the notification pipeline without any application-level instrumentation.

Flink consumes these events, applies the notification logic, and produces notification requests to output Kafka topics, one per delivery channel.

Channel consumers read from those output topics and handle the actual delivery: calling the email API, sending the push notification through Firebase or APNs, dispatching the SMS via Twilio, or writing to the in-app notification store.

This separation is important. Flink handles the logic. Delivery is handled by dedicated consumers that can retry, batch, and manage connections to external services independently.

Event-Driven Notification Triggers

The first step is defining which events trigger notifications. In a simple case, every event of a certain type produces a notification. A new order always generates an order confirmation. An account lockout always generates a security alert.

In practice, many notifications depend on conditions. A price drop notification should only fire if the new price is below the user’s configured threshold. A delivery update should only fire for certain status transitions (shipped, out for delivery, delivered), not every internal status change.

In Flink, you express this with a ProcessFunction that filters and transforms:

public class NotificationTrigger
    extends KeyedProcessFunction<String, Event, NotificationRequest> {

    @Override
    public void processElement(Event event, Context ctx,
                               Collector<NotificationRequest> out) {
        if (event.type.equals("ORDER_CREATED")) {
            out.collect(NotificationRequest.builder()
                .userId(event.userId)
                .notificationType("order_confirmation")
                .eventId(event.eventId)
                .payload(event.data)
                .priority(Priority.HIGH)
                .build());
        }

        if (event.type.equals("PRICE_CHANGE")) {
            double newPrice = event.data.getDouble("price");
            double threshold = event.data.getDouble("userThreshold");
            if (newPrice <= threshold) {
                out.collect(NotificationRequest.builder()
                    .userId(event.userId)
                    .notificationType("price_alert")
                    .eventId(event.eventId)
                    .payload(event.data)
                    .priority(Priority.NORMAL)
                    .build());
            }
        }
    }
}

The stream is keyed by userId, which means all events for a given user are processed by the same Flink task instance. This is what makes the downstream deduplication and rate limiting possible, since the state for each user is local to that task.

Deduplication

Duplicate events are a fact of life in distributed systems. A database change might be captured twice if a CDC connector restarts. An application might emit the same event multiple times due to retries. Without deduplication, the user receives the same notification repeatedly.

Flink handles this with a stateful deduplication operator. The idea is simple: maintain a set of recently seen event IDs in keyed state. If an incoming event’s ID is already in the set, drop it. If not, add it and pass the event through.

public class Deduplicator
    extends KeyedProcessFunction<String, NotificationRequest, NotificationRequest> {

    private MapState<String, Boolean> seenEvents;

    @Override
    public void open(Configuration parameters) {
        MapStateDescriptor<String, Boolean> descriptor =
            new MapStateDescriptor<>("seen-events", String.class, Boolean.class);

        StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(24))
            .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
            .cleanupFullSnapshot()
            .build();
        descriptor.enableTimeToLive(ttlConfig);

        seenEvents = getRuntimeContext().getMapState(descriptor);
    }

    @Override
    public void processElement(NotificationRequest request, Context ctx,
                               Collector<NotificationRequest> out) throws Exception {
        if (seenEvents.contains(request.eventId)) {
            return; // duplicate, drop it
        }
        seenEvents.put(request.eventId, true);
        out.collect(request);
    }
}

The TTL configuration is important. Without it, the state would grow indefinitely as new event IDs accumulate. A 24-hour TTL means events are deduplicated within a day. After that, the entry expires and frees the memory. Choose the TTL based on how long duplicates might realistically arrive after the original event.

Rate Limiting

Even after deduplication, you might generate too many notifications for a single user in a short period. If a user has 15 items on a watchlist and prices change rapidly, they could receive 15 price alert notifications in a minute. That is a fast path to the user disabling notifications entirely.

Rate limiting in Flink works by tracking the notification history per user per notification type. Here is an approach using a counter and a timer:

public class RateLimiter
    extends KeyedProcessFunction<String, NotificationRequest, NotificationRequest> {

    private ValueState<Integer> notificationCount;
    private ValueState<Long> windowStart;

    private static final int MAX_PER_WINDOW = 3;
    private static final long WINDOW_DURATION = 60 * 60 * 1000L; // 1 hour

    @Override
    public void open(Configuration parameters) {
        notificationCount = getRuntimeContext().getState(
            new ValueStateDescriptor<>("count", Integer.class, 0));
        windowStart = getRuntimeContext().getState(
            new ValueStateDescriptor<>("windowStart", Long.class, 0L));
    }

    @Override
    public void processElement(NotificationRequest request, Context ctx,
                               Collector<NotificationRequest> out) throws Exception {
        long now = ctx.timestamp();
        long start = windowStart.value();

        // Reset window if it has expired
        if (now - start >= WINDOW_DURATION) {
            notificationCount.update(0);
            windowStart.update(now);
        }

        int count = notificationCount.value();
        if (count < MAX_PER_WINDOW) {
            notificationCount.update(count + 1);
            out.collect(request);
        }
        // else: suppressed
    }
}

This limits each user to 3 notifications per hour for each notification type (since the stream is keyed by user ID and notification type). Suppressed notifications are simply dropped. If you want to track what was suppressed for analytics or debugging, emit the suppressed events to a side output:

static final OutputTag<NotificationRequest> SUPPRESSED =
    new OutputTag<>("suppressed") {};

// Inside processElement, when suppressing:
ctx.output(SUPPRESSED, request);

You can then write the suppressed stream to a separate Kafka topic or logging system.

The rate limits themselves can be more sophisticated than a simple counter. You might use different limits for different notification types (security alerts get a higher limit than marketing nudges) or different limits for different times of day (stricter limits during sleeping hours). All of this is just conditional logic in the process function, backed by Flink’s state.

User Preference Filtering

Users expect control over what notifications they receive and how they receive them. Some users want email only. Others want push notifications for urgent alerts and a daily email digest for everything else. Some have opted out of SMS entirely.

The standard approach is to store user preferences in your application database and look them up when sending a notification. In a Flink pipeline, you can do this more efficiently by streaming the preferences into Flink as a second input.

Streamkap captures changes to the user preferences table via CDC and delivers them to a Kafka topic. Flink consumes this as a broadcast stream and joins it with the notification stream:

// Preference updates from CDC via Streamkap
DataStream<UserPreferences> prefStream = env.fromSource(
    prefSource, WatermarkStrategy.noWatermarks(), "Preferences");

BroadcastStream<UserPreferences> broadcastPrefs =
    prefStream.broadcast(new MapStateDescriptor<>(
        "user-prefs",
        String.class,
        UserPreferences.class));

DataStream<ChannelNotification> routed = notificationStream
    .connect(broadcastPrefs)
    .process(new PreferenceRouter());

The PreferenceRouter stores preferences in broadcast state and uses them to route each notification:

public class PreferenceRouter
    extends BroadcastProcessFunction<NotificationRequest, UserPreferences, ChannelNotification> {

    private final MapStateDescriptor<String, UserPreferences> prefsDescriptor =
        new MapStateDescriptor<>("user-prefs", String.class, UserPreferences.class);

    @Override
    public void processBroadcastElement(UserPreferences prefs, Context ctx,
                                        Collector<ChannelNotification> out) throws Exception {
        ctx.getBroadcastState(prefsDescriptor).put(prefs.userId, prefs);
    }

    @Override
    public void processElement(NotificationRequest request, ReadOnlyContext ctx,
                               Collector<ChannelNotification> out) throws Exception {
        UserPreferences prefs = ctx.getBroadcastState(prefsDescriptor).get(request.userId);

        if (prefs == null) {
            // Default: send to in-app only
            out.collect(new ChannelNotification(request, Channel.IN_APP));
            return;
        }

        for (Channel channel : prefs.enabledChannels(request.notificationType)) {
            out.collect(new ChannelNotification(request, channel));
        }
    }
}

When a user updates their preferences in the application, the change flows through the database, through CDC (via Streamkap), through Kafka, and into Flink’s broadcast state. The next notification for that user will reflect the updated preferences. No restart required, no cache invalidation, no polling loop.

Multi-Channel Delivery

After preference routing, each ChannelNotification carries both the notification content and the target channel. The final step in the Flink job is to write these to per-channel Kafka topics:

routed
    .filter(cn -> cn.channel == Channel.EMAIL)
    .sinkTo(emailKafkaSink);

routed
    .filter(cn -> cn.channel == Channel.PUSH)
    .sinkTo(pushKafkaSink);

routed
    .filter(cn -> cn.channel == Channel.SMS)
    .sinkTo(smsKafkaSink);

routed
    .filter(cn -> cn.channel == Channel.IN_APP)
    .sinkTo(inAppKafkaSink);

Each Kafka topic has a dedicated consumer service that handles delivery for that channel. The email consumer calls SendGrid or SES. The push consumer calls Firebase Cloud Messaging. The SMS consumer calls Twilio. The in-app consumer writes to a database or cache that the application frontend reads.

This separation has a practical benefit: if one channel goes down, the others continue operating. If the SMS provider has an outage, the SMS Kafka consumer falls behind, but the Flink job keeps running and email, push, and in-app notifications all continue to flow. When the SMS provider recovers, the consumer catches up and delivers the queued messages.

It also lets you tune each channel independently. Email delivery might batch messages for throughput. Push notifications need low latency. SMS has strict rate limits imposed by carriers. Each consumer handles these concerns without affecting the others or the Flink job.

Content Templating

The Flink job produces structured notification requests with a type, a priority, and a payload of key-value data. The actual rendering of the notification (subject lines, body text, HTML templates) can happen either in Flink or in the channel consumers.

For simpler setups, rendering in the channel consumer works fine. The consumer receives a notification type and payload, looks up the template, fills in the variables, and sends it.

For more complex setups where you want centralized control over content, you can add a templating step in Flink before the channel split:

public class ContentRenderer
    extends ProcessFunction<NotificationRequest, RenderedNotification> {

    @Override
    public void processElement(NotificationRequest request, Context ctx,
                               Collector<RenderedNotification> out) {
        Template template = TemplateRegistry.get(
            request.notificationType, request.priority);

        String subject = template.renderSubject(request.payload);
        String body = template.renderBody(request.payload);

        out.collect(new RenderedNotification(
            request.userId, request.notificationType,
            subject, body, request.priority, request.eventId));
    }
}

The TemplateRegistry can load templates from a file bundled with the Flink job, or from a broadcast stream if you want to update templates without redeploying.

Monitoring and Operational Considerations

A notification engine has unique monitoring needs because failures are directly visible to users. A dropped notification means a user does not get an alert they expected. A duplicate means annoyance and potential trust issues.

Key metrics to track:

  • Notifications generated per second by type and channel. A sudden drop could mean the event source stopped producing or the trigger logic has a bug.
  • Suppression rate. What percentage of notifications are being deduplicated or rate-limited? If the suppression rate is very high, either the event source is producing too many duplicates or the rate limits are too aggressive.
  • End-to-end latency. Measure from event timestamp to delivery confirmation. For urgent notifications like security alerts, you should target under 5 seconds end-to-end.
  • Dead letter queue size. Events that fail validation or cannot be processed should go to a dead letter topic. If this topic is growing, investigate immediately.

Streamkap’s managed Flink service handles the operational side of running the Flink cluster: checkpointing, scaling, and recovery. Your monitoring focus can be on the application-level metrics above rather than infrastructure metrics like JVM heap usage or TaskManager restarts.

Scaling the Engine

The notification engine scales along two axes: event throughput and user count.

Event throughput scales with Kafka partitions and Flink parallelism. If you partition the notification topic by user ID, Flink distributes users across parallel tasks automatically. Adding more partitions and increasing parallelism is a configuration change, not a code change.

User count affects the size of Flink’s keyed state (deduplication sets, rate limit counters, preferences). Flink stores this state on disk using RocksDB when it exceeds available memory, so state size scales well beyond what fits in RAM. The primary concern is checkpoint size: more state means larger checkpoints and longer recovery times. Incremental checkpoints, which Streamkap’s managed Flink supports, mitigate this by only persisting the state that changed since the last checkpoint.

For a notification engine handling millions of users and thousands of events per second, a Flink job with moderate parallelism (16 to 32 tasks) is typically sufficient. The processing logic per event is lightweight compared to the volume of data.

Why Stream Processing Fits This Problem

You could build a notification engine with a traditional queue-and-worker architecture. Events go into a queue, workers pull events and check a database for deduplication and preferences, apply rate limiting with Redis, and dispatch to channels. Many teams do exactly this.

The stream processing approach has advantages as the system grows. All the state that workers would query from external stores (deduplication history, rate limit counters, user preferences) lives in the processing layer itself. There are no network round-trips to Redis or the database for every event. State access is local and fast.

Flink’s exactly-once guarantees also mean you do not need to build idempotency into every channel consumer. If Flink says a notification should be sent, it has already been deduplicated and rate-limited correctly, even across restarts and failures.

With Streamkap feeding the event stream via CDC and managing the Flink infrastructure, the operational surface area is small. You define the notification logic, deploy it, and monitor the business metrics. The plumbing takes care of itself.