<--- Back to all resources
Real-Time Feature Computation: From Raw Events to ML-Ready Features
How to compute machine learning features in real time using stream processing. Covers feature types, windowed aggregations, feature stores, and the training-serving skew problem.
Most ML models in production don’t fail because of bad algorithms. They fail because the features they depend on are stale, inconsistent, or computed differently between training and serving. If your fraud detection model was trained on “transactions in the last hour” calculated by a nightly SQL job, but your serving pipeline computes the same feature from a streaming window, you have a problem. The model sees data it was never trained on, and its predictions degrade silently.
Real-time feature computation solves this by building features directly from live event streams. Instead of waiting for batch jobs to materialize features into a warehouse, you compute them continuously as events arrive. This article walks through how to do that: the types of features you’ll encounter, how to compute them with windowed aggregations in Flink, how to avoid training-serving skew, and where feature stores fit into the picture.
The Three Classes of ML Features
Not all features need to be computed in real time. Understanding the spectrum helps you decide where to invest engineering effort.
Static features change rarely or never. A user’s country of registration, their account type, or a product’s category - these live in a database and can be looked up directly. They’re the simplest to manage. You join them in at serving time from a key-value store or a read replica.
Batch features are computed periodically from historical data. Think “total lifetime spend,” “average order value over the last 90 days,” or “number of support tickets filed this quarter.” These are typically produced by scheduled SQL queries or Spark jobs that run hourly or daily. They’re accurate as of the last batch run, but they’re always at least somewhat stale.
Real-time features reflect what’s happening right now. “Number of transactions in the last 5 minutes,” “average cart value in the current session,” “ratio of failed to successful API calls in the last 10 minutes.” These features are computed from live event streams and are the hardest to get right. They’re also the most valuable for models that need to react to recent behavior - fraud detection, dynamic pricing, personalized recommendations, and anomaly detection all depend heavily on them.
The practical reality is that most production models use a mix of all three. A fraud model might combine static features (account age, country) with batch features (historical spending patterns) and real-time features (recent transaction velocity). The challenge is computing and serving all of them consistently.
Computing Features with Windowed Aggregations
Real-time features almost always involve some form of aggregation over a time window. “Count of events in the last N minutes” is the canonical example, but you’ll also see averages, sums, percentiles, distinct counts, and ratios.
Flink is a natural fit for this work because it was built around the concept of windowed, stateful computation over event streams. Let’s look at the three window types you’ll use most.
Tumbling Windows
A tumbling window divides time into fixed, non-overlapping intervals. A 1-hour tumbling window starting at midnight produces buckets for 00:00-01:00, 01:00-02:00, and so on. Each event falls into exactly one bucket.
This is the simplest window type and works well for features like “transactions per hour” or “page views per day.” The tradeoff is that the feature value is only updated at the window boundary. A transaction at 12:01 and one at 12:59 land in the same bucket, but the feature value won’t be emitted until 1:00.
In Flink SQL, this looks like:
SELECT
user_id,
COUNT(*) AS txn_count,
SUM(amount) AS txn_total,
TUMBLE_END(event_time, INTERVAL '1' HOUR) AS window_end
FROM transactions
GROUP BY
user_id,
TUMBLE(event_time, INTERVAL '1' HOUR)
Sliding Windows
A sliding window has a fixed size but advances by a smaller step. A 1-hour window that slides every 5 minutes means you’re always computing “events in the last 60 minutes,” updated every 5 minutes. Events can belong to multiple overlapping windows.
Sliding windows are more useful for real-time features because they provide smoother, more frequently updated values. The cost is higher computational overhead - each event contributes to multiple windows.
SELECT
user_id,
COUNT(*) AS txn_count_1h,
AVG(amount) AS avg_amount_1h,
HOP_END(event_time, INTERVAL '5' MINUTE, INTERVAL '1' HOUR) AS window_end
FROM transactions
GROUP BY
user_id,
HOP(event_time, INTERVAL '5' MINUTE, INTERVAL '1' HOUR)
Session Windows
Session windows group events by activity gaps. If a user generates events within 30 minutes of each other, they’re in the same session. Once 30 minutes pass with no events, the session closes. This is useful for features like “number of actions in the current session” or “session duration.”
Session windows are more complex to implement because the window boundaries depend on the data itself, not a fixed clock. Flink handles this natively, but you need to think carefully about your gap duration and what happens when sessions are very long or very short.
Event Time vs. Processing Time
A common mistake is computing features based on processing time (when the event reaches your system) rather than event time (when the event actually occurred). If your pipeline has any latency or reprocesses data, processing-time features will be wrong.
Flink’s event-time processing with watermarks handles this correctly. Watermarks track how far along in event time the stream has progressed, allowing Flink to know when a window is complete even if some events arrived late. This is important for feature consistency - if you reprocess a day’s worth of data, you should get the same feature values you would have gotten in real time.
The Training-Serving Skew Problem
Training-serving skew is the single biggest risk in real-time ML systems. It happens when features are computed differently during training than during serving. The symptoms are subtle: your model has excellent offline metrics but performs poorly in production. Here’s why it happens and how to fix it.
How Skew Creeps In
During training, features are typically materialized by batch jobs. A data scientist writes a SQL query against a warehouse to compute “number of transactions in the last hour for each user at each point in time.” The query uses precise timestamp arithmetic and produces clean, consistent results.
During serving, that same feature is computed by a streaming pipeline processing live events. The streaming computation uses different windowing semantics, handles null values differently, or has subtle time-boundary differences. Maybe the batch query uses inclusive boundaries while the streaming window uses exclusive boundaries. Maybe the batch query fills missing values with zero while the streaming pipeline emits no value at all.
The result: the model receives feature distributions at serving time that don’t match what it saw during training. It’s not a bug in the traditional sense - both pipelines are “correct” - but the inconsistency degrades predictions.
Solutions to Training-Serving Skew
Use the same computation for both. The gold standard is to compute features exactly once, using the same code path, and log them at serving time for later use in training. This is sometimes called “log and wait” - you deploy your streaming feature pipeline, log every feature vector it produces, and train your next model version on those logged features. This eliminates skew entirely because training data is literally the output of the serving pipeline.
Replay your streaming pipeline over historical data. If you have access to historical event streams (stored in Kafka, S3, or an event store), you can run your Flink job over that historical data to produce training features. Because you’re using the same Flink job with event-time semantics, the features will match what the pipeline would produce in real time. Streamkap’s CDC pipelines can feed historical and live data through the same Flink topology, making this replay pattern straightforward.
Shared feature definitions. If you can’t use the same code, at least use the same definitions. A feature registry that specifies the exact computation logic - window size, aggregation function, null handling, time boundary semantics - can serve as a contract between batch and streaming pipelines. This doesn’t eliminate skew, but it makes discrepancies easier to find and fix.
Feature Stores: Bridging Batch and Real-Time
A feature store is the infrastructure that ties all of this together. At its simplest, it’s a dual-layer storage system: an offline store (typically a data warehouse or object storage) for training and a low-latency online store (typically Redis, DynamoDB, or a similar key-value store) for serving.
What a Feature Store Does
Consistent feature access. Both training pipelines and serving endpoints read features through the same API. The feature store handles the routing: training requests pull from the offline store, serving requests pull from the online store.
Feature sharing and reuse. If your fraud team and your recommendations team both need “user transaction count in the last hour,” they can register it once in the feature store and both consume it. Without a feature store, you end up with multiple teams independently computing the same feature, often with slight differences.
Point-in-time correctness. For training, you need to know what the feature values were at the time of the training example, not what they are now. A good feature store supports point-in-time joins, ensuring you don’t accidentally train on future data (a form of data leakage).
Feature monitoring. Feature stores can track distributions over time and alert when feature values drift. If the mean of “transactions per hour” suddenly doubles, you want to know before your model starts making bad predictions.
Do You Actually Need One?
Not always. If you have a single model with a handful of features, writing your streaming computation results directly to Redis and reading from Redis at serving time is perfectly fine. A feature store adds value when:
- Multiple models share overlapping features
- Training-serving skew is a recurring problem
- You need point-in-time correctness for training data
- Feature governance and discovery matter to your organization
Tools like Feast, Tecton, and Feathr each take a different approach. Feast is open source and gives you full control. Tecton is managed and handles the streaming computation for you. The right choice depends on your team’s engineering capacity and your willingness to operate infrastructure.
Practical Example: Fraud Detection Features
Let’s make this concrete with a fraud detection system. The model needs to decide in under 100 milliseconds whether a card transaction is fraudulent.
Static features (looked up at serving time):
- Account age in days
- Card type (credit, debit, prepaid)
- Country of card issuance
Batch features (updated hourly or daily):
- Average transaction amount over the last 30 days
- Number of distinct merchants in the last 90 days
- Historical chargeback rate
Real-time features (computed from the live stream):
- Transaction count in the last 10 minutes
- Sum of transaction amounts in the last hour
- Number of distinct countries transacted in during the last 30 minutes
- Time since last transaction (in seconds)
- Ratio of declined to approved transactions in the last hour
The real-time features are where Flink shines. You set up a keyed stream partitioned by card ID, apply sliding windows for the count and sum features, and maintain per-key state for “time since last transaction.” The results are written to a low-latency store (Redis, for instance) keyed by card ID. When a new transaction arrives, the serving layer reads the current feature values from Redis, combines them with static and batch features, and passes the full vector to the model.
CDC as a Feature Source
Change Data Capture deserves special attention as a feature source. Many of the events you want to compute features from don’t originate in Kafka - they’re rows being inserted, updated, or deleted in operational databases. Order placements, inventory changes, customer profile updates, payment status transitions - these are all database changes that carry signal for ML models.
CDC captures these changes as a stream of events that can be processed exactly like any other event stream. A new row in the orders table becomes an event. An update to payment_status becomes an event. You can feed these events into Flink, compute windowed aggregations, and produce features just as you would with native Kafka events.
Streamkap streams CDC events from databases like PostgreSQL, MySQL, and MongoDB into Kafka with sub-second latency. From there, the events flow into Flink for feature computation. This means your feature pipeline can react to database changes as fast as they happen, without polling queries or scheduled extracts.
This pattern is especially useful for features that combine operational and behavioral data. For example, a recommendation model might need “number of items added to cart in the last 5 minutes” (from clickstream events) combined with “current inventory level” (from a CDC stream on the inventory table). Both can be processed in the same Flink topology.
Writing Features to Serving Stores
The last mile of the pipeline is writing computed features to a store where your model can read them at serving time. The requirements are straightforward: low read latency (under 10 milliseconds), high availability, and the ability to handle the write throughput of your feature pipeline.
Redis is the most common choice for online feature serving. It offers sub-millisecond reads, supports TTLs for automatic expiration of stale features, and handles high write throughput well. The downside is cost at scale - keeping large feature sets in memory gets expensive.
DynamoDB or Cassandra work when you need durability and can tolerate slightly higher read latency (single-digit milliseconds). They’re also more cost-effective for large feature sets.
Hybrid approaches use Redis for the hottest features (those read on every request) and a cheaper store for less frequently accessed features. Many feature stores implement this pattern internally.
Your Flink job writes to the serving store using a sink connector. For Redis, this means issuing SET commands (or HSET for structured features) keyed by the entity ID (user ID, card ID, session ID). TTLs should match your feature’s window size - there’s no point keeping a “last 10 minutes” feature in the store for an hour.
Putting It All Together
The full pipeline looks like this: events flow from source systems - through CDC for database changes, through Kafka for native event streams - into Flink, where windowed aggregations and stateful computations produce feature values. Those values are written to a serving store (Redis, DynamoDB, or a feature store’s online layer). At prediction time, the model serving layer reads the latest features, combines them with static and batch features, and runs inference.
The hard part isn’t any single component. It’s making sure the feature computation is consistent between training and serving, that late-arriving events are handled correctly, and that feature freshness meets your model’s latency requirements. Get those right, and you have a feature pipeline that keeps your models accurate on data that’s seconds old rather than hours or days old.