<--- Back to all resources

Engineering

February 25, 2026

11 min read

Anomaly Detection in Streaming Data with Flink

How to detect anomalies in real-time data streams using Apache Flink. Covers statistical methods, windowed baselines, z-score detection, and integration with ML models.

TL;DR: • Anomaly detection in streaming requires maintaining running statistics (mean, standard deviation) without storing all historical data. • Simple z-score detection over sliding windows catches most anomalies with minimal complexity. • Flink's windowed aggregations and stateful operators provide the building blocks for both statistical and ML-based anomaly detection. • CDC can detect database-level anomalies like unusual write patterns, sudden schema changes, or unexpected bulk operations.

Most anomaly detection systems run in batch. They collect a day’s worth of data, compute statistics overnight, and surface alerts the next morning. By then, the damage is done. A sudden spike in failed transactions at 2 PM does not help anyone when the alert fires at 6 AM the following day.

Streaming anomaly detection flips this model. Instead of looking backward over stored datasets, you evaluate every data point as it arrives. The challenge is doing this without storing all the history, without blowing up memory, and without generating so many false positives that your on-call engineers start ignoring alerts.

Apache Flink is built for exactly this kind of work. Its stateful stream processing, flexible windowing, and exactly-once guarantees make it a natural fit for anomaly detection pipelines that need to run continuously, at scale, and with low latency.

Why Batch Anomaly Detection Falls Short

Batch systems process bounded datasets. You run a job over yesterday’s logs, compute the average error rate, and flag anything that exceeded three standard deviations. This works for post-hoc analysis but breaks down when you need timely responses.

The problems with batch anomaly detection are well-known:

  • Delayed alerts. By the time the batch job completes, the anomaly may have caused hours of degraded service.
  • Stale baselines. If your traffic pattern shifts at noon and the batch job uses midnight-to-midnight windows, the first half of the day skews the baseline.
  • No incremental state. Each batch run recomputes statistics from scratch. For large datasets, this is expensive and slow.

Streaming anomaly detection addresses all three. Alerts fire within seconds. Baselines update continuously. State is maintained incrementally, so each new data point costs constant time and memory.

Statistical Methods for Streaming Anomaly Detection

Before reaching for a machine learning model, consider whether a statistical method will do the job. For single-metric monitoring (latency, error rate, request throughput, queue depth), statistical approaches are often more interpretable, easier to debug, and faster to deploy.

Z-Score Detection

The z-score measures how many standard deviations a data point is from the mean:

z = (x - mean) / stddev

If |z| > 3, the data point is more than three standard deviations from the mean, which happens less than 0.3% of the time under a normal distribution. That is a reasonable starting threshold for flagging anomalies.

In a streaming context, you compute the mean and standard deviation over a sliding window (the last 1,000 events, or the last hour of data) and apply the z-score formula to each incoming event. This is the simplest method that actually works in production, and it handles most common anomaly patterns.

Interquartile Range (IQR)

The IQR method is more tolerant of skewed distributions. You maintain the 25th percentile (Q1) and 75th percentile (Q3) over a sliding window, then flag any value below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

The downside in streaming is that exact percentile computation over a window requires sorting, which is expensive. Approximate quantile algorithms like t-digest or KLL sketches make this practical. Flink’s DataStream API supports custom aggregation functions where you can plug in these sketches.

Exponentially Weighted Moving Average (EWMA)

EWMA gives more weight to recent data points and less to older ones, which makes it naturally adaptive to trends. The formula is:

ewma_new = alpha * x + (1 - alpha) * ewma_old

Where alpha is a smoothing factor between 0 and 1. A higher alpha makes the average more responsive to recent changes. You can maintain an EWMA for both the mean and variance, then apply z-score logic against them.

EWMA is attractive for streaming because it requires only a few bytes of state: the current EWMA value and the current EWMA variance. No windows, no stored events. The tradeoff is that you have less control over the exact lookback period, and tuning alpha requires experimentation.

Flink’s windowing system provides the structure for maintaining rolling baselines. Here is how the pieces fit together.

Sliding Windows for Rolling Baselines

A sliding window of size 1 hour with a slide of 1 minute gives you a baseline that updates every minute, based on the last hour of data. In Flink SQL:

SELECT
  sensor_id,
  AVG(temperature) AS mean_temp,
  STDDEV_POP(temperature) AS stddev_temp,
  COUNT(*) AS sample_count
FROM sensor_readings
GROUP BY
  sensor_id,
  HOP(event_time, INTERVAL '1' MINUTE, INTERVAL '1' HOUR)

This gives you per-sensor baselines that update every minute. You can then join incoming events against the latest baseline to compute z-scores:

SELECT
  r.sensor_id,
  r.temperature,
  b.mean_temp,
  b.stddev_temp,
  (r.temperature - b.mean_temp) / NULLIF(b.stddev_temp, 0) AS z_score
FROM sensor_readings r
JOIN sensor_baselines b
  ON r.sensor_id = b.sensor_id
WHERE ABS((r.temperature - b.mean_temp) / NULLIF(b.stddev_temp, 0)) > 3.0

Session Windows for Burst Detection

Sometimes anomalies are not individual outliers but clusters of unusual events. A sudden burst of 500 errors within 10 seconds is an anomaly, even if each individual error is unremarkable.

Flink’s session windows group events that arrive close together, with a configurable gap. If no event arrives within the gap duration, the window closes. This naturally detects bursts:

SELECT
  service_name,
  COUNT(*) AS error_count,
  SESSION_START(event_time, INTERVAL '10' SECOND) AS burst_start,
  SESSION_END(event_time, INTERVAL '10' SECOND) AS burst_end
FROM error_events
GROUP BY
  service_name,
  SESSION(event_time, INTERVAL '10' SECOND)
HAVING COUNT(*) > 50

This fires an alert whenever more than 50 errors cluster within a 10-second session window for any service.

Welford’s Algorithm for Incremental Statistics

If you are building a custom Flink operator for anomaly detection, Welford’s online algorithm is the right way to maintain running mean and variance without storing all data points. The algorithm updates three values with each new observation:

// State: count, mean, m2
count += 1;
double delta = newValue - mean;
mean += delta / count;
double delta2 = newValue - mean;
m2 += delta * delta2;

// Variance and standard deviation
double variance = m2 / count;
double stddev = Math.sqrt(variance);

This is numerically stable (unlike the naive formula of tracking sum and sum-of-squares separately) and uses O(1) memory regardless of how many data points you have processed.

In Flink, you would implement this as a KeyedProcessFunction or a custom AggregateFunction. The state (count, mean, m2) is stored in Flink’s managed state, so it survives checkpoints and restarts. You can also combine Welford’s algorithm with a decay factor to create an exponentially weighted variant that forgets old data gradually.

Sliding Welford’s

For a strict sliding window (e.g., only the last N data points), you need to both add new values and remove expired ones. This requires storing the values in the window, which costs O(N) memory but still gives you O(1) per-event updates. Flink’s ListState can hold the values in the window, and you subtract the expired value’s contribution using the inverse of Welford’s update.

Integrating ML Models for Complex Anomalies

Statistical methods work well for univariate metrics with roughly stable distributions. But some anomalies are multivariate: a combination of slightly elevated latency, slightly increased error rate, and slightly reduced throughput that individually look fine but together indicate a problem.

For these cases, ML models can capture correlations between features that statistical methods miss.

Online vs. Offline Model Training

You have two options for ML-based anomaly detection in Flink:

Offline training, online inference. Train a model (isolation forest, autoencoder, or one-class SVM) on historical data in a batch environment. Export the model, load it into a Flink operator, and run inference on each incoming event. This is simpler to implement and easier to validate.

Online training and inference. Update the model with each new data point inside Flink. This is harder to get right but adapts to distribution shifts without manual retraining. Libraries like Apache Flink ML provide building blocks for this, though the ecosystem is still maturing.

For most teams, offline training with periodic retraining (daily or weekly) is the practical choice. Load the model into a KeyedProcessFunction, score each event, and emit an alert when the anomaly score exceeds your threshold.

Feature Engineering in the Stream

ML models need features, and computing features in a streaming context means using Flink’s windowed aggregations as inputs. For example, your feature vector for a web service might include:

  • Mean latency over the last 5 minutes
  • 99th percentile latency over the last 5 minutes
  • Error rate over the last 5 minutes
  • Request count over the last 5 minutes
  • Ratio of current request count to the same 5-minute window yesterday

Each of these is a windowed aggregation in Flink. You compute them in parallel, join them by key and timestamp, and feed the combined vector to the model.

Anomaly Detection in CDC Streams

Change data capture (CDC) streams carry every insert, update, and delete from a source database. This makes them a rich signal for detecting database-level anomalies that application metrics might miss.

Here are patterns worth monitoring:

  • Write volume spikes. A sudden 10x increase in inserts to a table usually means either a legitimate bulk load or a runaway process. Either way, someone should know.
  • Unexpected deletes. Bulk deletes on tables that rarely see them can indicate accidental data loss or a misconfigured cleanup job.
  • Schema changes. DDL events in the CDC stream (new columns, dropped tables) can break downstream consumers. Detecting these in real time lets you react before dashboards start showing nulls.
  • Write pattern shifts. If a table that normally receives 100 writes per minute suddenly drops to zero, the source application may have crashed silently.

In Flink, you can monitor these by computing windowed aggregations over the CDC event stream, keyed by table name and operation type. A Streamkap pipeline feeding CDC events from PostgreSQL, MySQL, or MongoDB into Flink gives you this visibility without instrumenting the source application. You are monitoring the database directly, which catches issues that application-level metrics cannot see.

SELECT
  table_name,
  operation_type,
  COUNT(*) AS op_count,
  TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start
FROM cdc_events
GROUP BY
  table_name,
  operation_type,
  TUMBLE(event_time, INTERVAL '1' MINUTE)

Compare each window’s count against the baseline for that table and operation type. A z-score greater than 3 on the per-minute delete count for your users table is worth investigating immediately.

Building the Alerting Layer

Detection without alerting is just logging. Once your Flink job identifies an anomaly, you need to route it to the right people with enough context to act.

Alert Enrichment

A raw anomaly event (“sensor_42 temperature z-score = 4.7”) is not actionable by itself. Before sending the alert, enrich it with context:

  • The current value and the baseline (mean and standard deviation)
  • How long the anomaly has persisted (is this a single spike or sustained?)
  • Related metrics (did anything else change at the same time?)
  • A link to the relevant dashboard or query

You can do this enrichment in Flink by joining the anomaly stream with reference data (sensor metadata, service ownership, runbook URLs) stored in a lookup table or broadcast state.

Deduplication and Suppression

If a metric stays anomalous for 30 minutes, you do not want 30 alerts. Use Flink’s KeyedProcessFunction with a timer to suppress duplicate alerts. When the first anomaly fires, start a suppression timer (say, 15 minutes). During suppression, count additional anomalous events but do not alert. When the timer fires, send a summary: “sensor_42 has been anomalous for 15 minutes with 47 anomalous readings.”

Routing

Sink the enriched, deduplicated alerts to the appropriate channel:

  • PagerDuty or Opsgenie for on-call alerting
  • Slack or Teams for team-level visibility
  • A data store (ClickHouse, Elasticsearch, or a data warehouse) for historical analysis

Flink’s sink ecosystem supports all of these. For platforms like Streamkap that already connect your databases to downstream destinations, you can write anomaly events to a Kafka topic and let the existing pipeline deliver them wherever they need to go.

Tuning for Production

Anomaly detection systems in production face two recurring problems: too many false positives and missed true positives. Here are practical tuning strategies.

Start conservative. Begin with a z-score threshold of 4 or 5, not 3. You will miss some real anomalies, but you will also avoid alert fatigue. Lower the threshold gradually as you build confidence in the baseline.

Use minimum sample sizes. Do not compute z-scores until your window has at least 30 data points. Below that, the standard deviation estimate is unreliable and a single outlier can dominate the baseline.

Separate weekday and weekend baselines. Many business metrics have weekly seasonality. A z-score computed against the Monday baseline will fire false positives every Saturday morning. Key your baselines by day-of-week (or hour-of-day for metrics with diurnal patterns).

Monitor the monitor. Track your anomaly detection system’s false positive rate and detection latency. If 90% of alerts are false positives, your threshold or baseline window needs adjustment. If detection latency exceeds your SLA, you may need to reduce window sizes or switch to EWMA-based approaches.

Version your baselines. When you change detection parameters (thresholds, window sizes, features), treat it like a code deployment. Log the change, run the new configuration in shadow mode alongside the old one, and compare results before switching over.

Putting It All Together

A production anomaly detection pipeline in Flink typically looks like this:

  1. Ingest events from Kafka, CDC streams (via Streamkap), or other sources into Flink.
  2. Compute baselines using sliding windows and Welford’s algorithm, keyed by entity and metric.
  3. Score each incoming event against its baseline using z-scores, IQR, or an ML model.
  4. Enrich anomaly events with context from reference data.
  5. Deduplicate alerts using stateful suppression timers.
  6. Route to PagerDuty, Slack, a data warehouse, or all three.

Each step is a Flink operator. The pipeline runs continuously, maintains its state through checkpoints, and recovers automatically from failures. You deploy it once and it watches your data around the clock.

The barrier to entry is lower than you might expect. A z-score detector over sliding windows, implemented in Flink SQL, can be up and running in an afternoon. That alone will catch the obvious anomalies. From there, you can iterate: add more metrics, experiment with ML models, tune your alerting, and build the feedback loops that make the system smarter over time.