<--- Back to all resources

Engineering

February 25, 2026

11 min read

IoT Sensor Data Processing with Apache Flink

Learn how to process IoT sensor data with Apache Flink. Covers high-throughput ingestion, out-of-order event handling, downsampling, threshold alerting, and edge vs cloud processing patterns.

TL;DR: • IoT sensor streams produce massive event volumes with unpredictable arrival order, making Flink's watermark-based event-time processing a natural fit. • Downsampling raw sensor readings into time-bucketed aggregates reduces storage costs by 90% or more without losing actionable signal. • Flink SQL window functions and pattern matching (MATCH_RECOGNIZE) enable threshold-based alerting directly in the stream processing layer. • Streamkap's managed Flink and Kafka infrastructure handles the operational burden of running high-throughput IoT pipelines at scale.

A single industrial sensor reporting once per second generates 86,400 readings per day. A factory floor with 10,000 sensors produces 864 million data points daily. A fleet of connected vehicles, a network of environmental monitors, or a smart building system can easily push that number into the billions. The question is not whether you can collect all that data. The question is whether you can make sense of it fast enough to act on it.

Batch processing IoT data means you discover that a motor overheated after the damage is done. You find out about a pressure anomaly in yesterday’s report, hours after the pipeline shut down. For IoT, the value of data decays fast. A temperature spike matters right now. It matters much less tomorrow morning.

Apache Flink is built for exactly this kind of workload: high-volume, continuous data streams where event ordering is unpredictable and processing latency must stay low. In this guide, we will walk through how to build an IoT sensor processing pipeline with Flink, covering ingestion, out-of-order handling, downsampling, alerting, and the trade-offs between edge and cloud processing.

The IoT Data Challenge

IoT data has characteristics that make it different from typical application event streams:

Volume. Sensor networks produce orders of magnitude more events than web applications. A medium-sized manufacturing plant can easily generate millions of events per minute.

Disorder. Sensors often operate on constrained networks. A device might buffer readings during a connectivity outage and flush them all at once. Satellite-connected sensors can have latencies measured in minutes. Events frequently arrive out of order.

Variety. A single deployment might include temperature sensors, pressure gauges, accelerometers, flow meters, and GPS trackers, each with different schemas, sampling rates, and units of measurement.

Criticality. Unlike a delayed ad impression or a late product recommendation, a missed sensor alert can result in equipment damage, environmental incidents, or safety hazards. The processing pipeline needs to be reliable.

Flink addresses all of these with its event-time processing model, scalable parallel execution, and exactly-once state guarantees.

Ingestion Architecture

The first step is getting sensor data into a system where Flink can process it. The standard pattern uses MQTT or HTTP at the edge to collect readings from devices, a gateway that translates and forwards events to Kafka, and Flink consuming from Kafka topics.

Here is the Flink SQL table definition for a sensor event stream:

CREATE TABLE sensor_readings (
  device_id      STRING,
  sensor_type    STRING,     -- 'temperature', 'pressure', 'vibration', 'humidity'
  reading_value  DOUBLE,
  unit           STRING,     -- 'celsius', 'psi', 'mm_s', 'percent'
  device_time    TIMESTAMP(3),
  ingestion_time TIMESTAMP(3) METADATA FROM 'timestamp',
  WATERMARK FOR device_time AS device_time - INTERVAL '30' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'sensor-readings',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'json',
  'scan.startup.mode' = 'latest-offset'
);

Two things to note here. First, we use device_time (the timestamp from the sensor itself) as the event-time column, not the Kafka ingestion timestamp. This is important because sensor devices may buffer readings, and we want Flink to process them according to when the measurement was actually taken. Second, the watermark allows 30 seconds of out-of-orderness, which is a reasonable default for sensors on cellular or satellite connections.

You also want a device registry table so you can enrich sensor readings with metadata like location, device model, and alert thresholds. Streamkap can stream this from your operational database:

CREATE TABLE device_registry (
  device_id       STRING,
  device_name     STRING,
  location        STRING,
  facility        STRING,
  device_model    STRING,
  install_date    DATE,
  alert_threshold DOUBLE,
  PRIMARY KEY (device_id) NOT ENFORCED
) WITH (
  'connector' = 'upsert-kafka',
  'topic' = 'device-registry-cdc',
  'properties.bootstrap.servers' = 'kafka:9092',
  'key.format' = 'json',
  'value.format' = 'json'
);

Handling Out-of-Order Events

Out-of-order data is the norm in IoT, not the exception. A sensor on a remote oil rig might lose connectivity for 10 minutes and then transmit a burst of buffered readings. A fleet vehicle driving through a tunnel will queue up GPS points and flush them when it emerges.

Flink’s watermark strategy determines how long the system waits for late data before finalizing computations. The 30-second watermark we defined above means Flink will hold window results open for up to 30 seconds past the window boundary, waiting for stragglers.

But what about events that arrive after the watermark has already advanced past them? By default, Flink drops these. For IoT workloads, that is often unacceptable. You have two options:

Allowed lateness. In the DataStream API, you can configure an allowed lateness on windows, which keeps the window state around for additional time and re-fires the window computation when late events arrive.

Side outputs for late data. You can route late events to a separate Kafka topic or storage system for later reconciliation:

-- In DataStream API (pseudocode)
-- lateReadings = mainStream
--   .sideOutputLateData(lateDataTag)
--   .writeToKafka("late-sensor-readings")

For most IoT applications, a watermark delay of 30 seconds to 2 minutes combined with a side output for anything later covers the vast majority of scenarios. The late data side output can feed a batch reconciliation job that corrects aggregates once per hour.

Downsampling Raw Readings

Storing every raw sensor reading at full resolution is expensive and often unnecessary. A temperature sensor reporting every second produces 31.5 million readings per year. If you are trending that temperature over time, one-minute averages give you the same visual trend at 1/60th of the storage cost.

Flink tumbling windows are the tool for this:

CREATE VIEW sensor_one_minute_agg AS
SELECT
  device_id,
  sensor_type,
  TUMBLE_START(device_time, INTERVAL '1' MINUTE) AS window_start,
  AVG(reading_value)  AS avg_value,
  MIN(reading_value)  AS min_value,
  MAX(reading_value)  AS max_value,
  STDDEV(reading_value) AS stddev_value,
  COUNT(*)            AS reading_count
FROM sensor_readings
GROUP BY
  device_id,
  sensor_type,
  TUMBLE(device_time, INTERVAL '1' MINUTE);

This produces one row per device per sensor type per minute, containing the average, minimum, maximum, standard deviation, and count of raw readings. You get trend visibility, anomaly detection inputs, and data quality metrics (the count tells you if a sensor is dropping readings) in a single pass.

For long-term storage, you can add a second level of downsampling:

CREATE VIEW sensor_one_hour_agg AS
SELECT
  device_id,
  sensor_type,
  TUMBLE_START(window_start, INTERVAL '1' HOUR) AS hour_start,
  AVG(avg_value)  AS avg_value,
  MIN(min_value)  AS min_value,
  MAX(max_value)  AS max_value,
  SUM(reading_count) AS total_readings
FROM sensor_one_minute_agg
GROUP BY
  device_id,
  sensor_type,
  TUMBLE(window_start, INTERVAL '1' HOUR);

This cascaded downsampling gives you three tiers of granularity: raw data retained for a few days for debugging, one-minute aggregates retained for weeks for operational dashboards, and one-hour aggregates retained indefinitely for long-term trend analysis.

Threshold-Based Alerting

The most immediate value of processing IoT data in real time is alerting. When a temperature exceeds a safe range, when vibration indicates bearing wear, or when pressure drops below normal, you need to know now.

Simple threshold alerts are straightforward in Flink SQL. Join the sensor readings against the device registry to get per-device thresholds, then filter:

CREATE VIEW threshold_alerts AS
SELECT
  s.device_id,
  d.device_name,
  d.location,
  d.facility,
  s.sensor_type,
  s.reading_value,
  d.alert_threshold,
  s.device_time
FROM sensor_readings s
JOIN device_registry d
  ON s.device_id = d.device_id
WHERE s.reading_value > d.alert_threshold;

This emits an alert event every time a single reading exceeds the threshold. In practice, you probably do not want an alert for every single spike. Sensors are noisy, and a momentary reading above threshold could be normal measurement variance.

A more practical approach alerts only when the average over a short window exceeds the threshold:

CREATE VIEW sustained_threshold_alerts AS
SELECT
  s.device_id,
  d.device_name,
  d.location,
  s.sensor_type,
  AVG(s.reading_value) AS avg_reading,
  d.alert_threshold,
  TUMBLE_START(s.device_time, INTERVAL '5' MINUTE) AS window_start
FROM sensor_readings s
JOIN device_registry d
  ON s.device_id = d.device_id
GROUP BY
  s.device_id, d.device_name, d.location,
  s.sensor_type, d.alert_threshold,
  TUMBLE(s.device_time, INTERVAL '5' MINUTE)
HAVING AVG(s.reading_value) > d.alert_threshold;

This fires only when the five-minute average exceeds the threshold, filtering out momentary noise while still catching sustained anomalies.

Pattern-Based Anomaly Detection

Some anomalies are not about absolute values but about patterns. A motor whose vibration increases steadily over 30 minutes might be heading toward failure, even if no single reading exceeds the threshold. Flink’s MATCH_RECOGNIZE clause lets you define event patterns:

SELECT *
FROM sensor_readings
MATCH_RECOGNIZE (
  PARTITION BY device_id
  ORDER BY device_time
  MEASURES
    FIRST(A.reading_value) AS start_value,
    LAST(A.reading_value)  AS end_value,
    FIRST(A.device_time)   AS start_time,
    LAST(A.device_time)    AS end_time,
    COUNT(A.*)             AS reading_count
  ONE ROW PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A{5,})
  DEFINE
    A AS A.reading_value > PREV(A.reading_value) * 1.02
) AS matched;

This pattern detects five or more consecutive readings where each is at least 2% higher than the previous one. It captures a steadily increasing trend that might indicate equipment degradation. The PARTITION BY device_id ensures the pattern matching runs independently for each device.

Edge vs Cloud Processing

Not all IoT processing belongs in the cloud. Network bandwidth, latency requirements, and data sovereignty constraints often push some processing closer to the data source.

Edge processing handles time-sensitive alerts and local data reduction. A Flink job running on an edge gateway can filter noise, downsample readings, and fire immediate alerts without a round trip to the cloud. This is where sub-second latency matters most.

Cloud processing handles cross-device correlation, historical analysis, and model training. Patterns that span multiple devices or require access to centralized reference data are better suited to a cloud-based Flink cluster.

A common architecture uses both:

  1. Edge Flink jobs perform per-device filtering, downsampling, and simple threshold alerting.
  2. Downsampled data streams to cloud Kafka topics.
  3. Cloud Flink jobs perform cross-device correlation, trend analysis, and pattern detection.
  4. Results write to time-series databases, data warehouses, and alerting systems.

Streamkap’s managed Kafka and Flink services cover the cloud tier of this architecture. Your edge processing handles the initial data reduction, and Streamkap’s infrastructure handles the heavy lifting of cross-device analytics, long-term aggregation, and integration with your downstream systems.

Writing Results to Time-Series Stores

IoT aggregates naturally fit time-series databases. Here is a sink definition that writes downsampled data to a JDBC-compatible store:

CREATE TABLE sensor_aggregates_sink (
  device_id    STRING,
  sensor_type  STRING,
  window_start TIMESTAMP(3),
  avg_value    DOUBLE,
  min_value    DOUBLE,
  max_value    DOUBLE,
  stddev_value DOUBLE,
  reading_count BIGINT,
  PRIMARY KEY (device_id, sensor_type, window_start) NOT ENFORCED
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:postgresql://timeseries-db:5432/iot_data',
  'table-name' = 'sensor_aggregates',
  'driver' = 'org.postgresql.Driver'
);

INSERT INTO sensor_aggregates_sink
SELECT * FROM sensor_one_minute_agg;

For alerting, you would write to a Kafka topic that feeds your notification system:

CREATE TABLE alert_events_sink (
  device_id    STRING,
  device_name  STRING,
  location     STRING,
  sensor_type  STRING,
  avg_reading  DOUBLE,
  threshold    DOUBLE,
  window_start TIMESTAMP(3)
) WITH (
  'connector' = 'kafka',
  'topic' = 'sensor-alerts',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'json'
);

INSERT INTO alert_events_sink
SELECT * FROM sustained_threshold_alerts;

Monitoring the Pipeline Itself

An IoT processing pipeline that goes down silently is a liability. You should track:

Event lag. The difference between the current wall-clock time and the event time of the most recently processed event. If lag grows, your pipeline is falling behind the data rate.

Checkpoint duration. Long checkpoint times indicate that state has grown too large or that the state backend is under pressure. For IoT workloads with millions of device keys, checkpoint performance is a leading indicator of trouble.

Device heartbeat. Track the last-seen timestamp per device. If a device stops reporting, you want to know whether the device is down or whether the pipeline dropped its events.

Watermark progress. If watermarks stall, it usually means one partition has stopped producing events. A single stalled partition can hold back the entire pipeline.

From Raw Signals to Operational Intelligence

Processing IoT sensor data with Flink transforms a firehose of raw readings into actionable intelligence. Downsampling cuts storage costs. Threshold alerting catches problems in seconds instead of hours. Pattern detection identifies degradation trends before they become failures.

The operational complexity of running this at scale, managing Kafka clusters that ingest millions of events per minute and Flink clusters that maintain terabytes of state, is real. This is where a managed platform pays for itself. Streamkap handles the infrastructure so your team stays focused on the sensor processing logic and the domain-specific alerting rules that deliver value to your operations.

Start with a single sensor type. Build the ingestion, downsampling, and alerting pipeline. Measure the time between a real-world event and your team being notified. Then expand to more sensors, more facilities, and more sophisticated pattern detection. The architecture scales horizontally, and with managed infrastructure, so does your team’s ability to operate it.