<--- Back to all resources
The Streaming Feature Store Pattern: Real-Time ML Features from CDC
Learn how streaming feature stores eliminate training-serving skew by computing ML features from CDC events in real time instead of batch jobs.
A feature store sits between your raw data and your ML models. Its job: pre-compute features (derived values like “customer lifetime value” or “average order frequency in last 30 days”) and serve them fast enough for real-time inference.
Most feature stores run on batch. A nightly job reads from the warehouse, computes features, and writes them to a serving layer. This works until it doesn’t — and it stops working the moment your models need features that reflect what’s happening now, not what happened 12 hours ago.
Streaming feature stores replace those batch jobs with continuous computation on CDC event streams. Every database change triggers a feature recomputation, and the serving layer updates within seconds.
What a Feature Store Actually Does
Strip away the marketing and a feature store has three responsibilities:
- Compute — Transform raw data into features (aggregations, ratios, lookups, time-windowed calculations)
- Store — Keep feature values in a format optimized for the access pattern (online serving vs offline training)
- Serve — Return feature values at inference time with low latency and at training time with point-in-time correctness
Raw Data → Feature Computation → Feature Store → Model Serving
│
└──→ Model Training
The computation step is where batch vs streaming matters most.
The Batch Feature Store Problem
Here’s a typical batch feature pipeline:
1. Warehouse ETL runs at 2:00 AM
2. Feature computation job starts at 3:00 AM
3. Features written to online store by 4:00 AM
4. Model uses features that are 4-26 hours stale
This creates two problems:
Problem 1: Staleness
A fraud detection model using “number of transactions in last hour” computed from batch data is always looking at yesterday’s transactions. A customer who made 50 transactions in the last 30 minutes looks identical to one who made 2.
Features that lose value with staleness:
- Transaction velocity (count per time window)
- Account balance
- Inventory levels
- Session activity (pages viewed, actions taken)
- Support ticket status
- Cart contents
Features that tolerate staleness:
- Customer lifetime value (changes slowly)
- Industry classification
- Geographic region
- Historical averages over long windows (90-day purchase frequency)
Problem 2: Training-Serving Skew
Training-serving skew is the silent model killer. It happens when the features your model trains on don’t match the features it sees during inference.
Example: During training, you compute avg_order_value_30d with a precise SQL window function over historical data. During serving, the batch job computes the same metric but uses a different time boundary (start of batch run vs exact 30 days before the event). The values are close but not identical, and the model’s accuracy degrades by 2-5% without any code change.
Streaming feature stores reduce this gap because the same computation logic runs continuously on the same data source. The online value and the offline value come from the same pipeline, just at different points in time.
Streaming Feature Store Architecture
Source Databases
│
↓ (CDC)
Event Stream (Kafka)
│
├──→ Stream Processor ──→ Online Store (Redis/DynamoDB)
│ (real-time features) ↓
│ Model Serving
│
└──→ Event Log ──→ Offline Store (S3/Data Lake)
(raw events) ↓
Model Training
CDC as the Foundation
CDC gives you a reliable, ordered stream of every data change. This is the input to your feature computation. Instead of querying the database on a schedule, you react to changes as they happen.
For a deeper look at connecting CDC to downstream systems, see CDC destination patterns.
Stream Processing for Feature Computation
Feature computation on a stream looks different from batch SQL. Instead of SELECT AVG(amount) FROM orders WHERE created_at > NOW() - INTERVAL '30 days', you maintain running state:
# Pseudocode: streaming feature computation
class OrderFeatures:
def __init__(self, customer_id: str):
self.customer_id = customer_id
self.order_count_30d = 0
self.order_total_30d = 0.0
self.orders_window: list = [] # (timestamp, amount) pairs
def process_event(self, event: dict):
now = event["ts_ms"]
if event["op"] in ("c", "u"): # create or update
order = event["after"]
self.orders_window.append((now, order["amount"]))
# Expire events older than 30 days
cutoff = now - (30 * 24 * 60 * 60 * 1000) # 30 days in ms
self.orders_window = [
(ts, amt) for ts, amt in self.orders_window if ts > cutoff
]
# Recompute features
self.order_count_30d = len(self.orders_window)
self.order_total_30d = sum(amt for _, amt in self.orders_window)
@property
def avg_order_value_30d(self) -> float:
if self.order_count_30d == 0:
return 0.0
return self.order_total_30d / self.order_count_30d
def to_feature_vector(self) -> dict:
return {
"customer_id": self.customer_id,
"order_count_30d": self.order_count_30d,
"order_total_30d": self.order_total_30d,
"avg_order_value_30d": self.avg_order_value_30d,
}
This is simplified — production implementations use state backends that handle checkpointing, exactly-once semantics, and efficient windowed aggregations. Tools like Streaming Agents handle this natively.
Online Store: Optimized for Serving
The online store needs to answer one question fast: “What are the current feature values for entity X?”
# Writing features to Redis
def update_online_store(features: dict, redis_client):
key = f"features:customer:{features['customer_id']}"
redis_client.hset(key, mapping={
"order_count_30d": features["order_count_30d"],
"order_total_30d": features["order_total_30d"],
"avg_order_value_30d": features["avg_order_value_30d"],
"updated_at": int(time.time() * 1000),
})
# Reading features at inference time
def get_features(customer_id: str, redis_client) -> dict:
key = f"features:customer:{customer_id}"
return redis_client.hgetall(key)
Redis lookups take 1-5ms. DynamoDB takes 5-15ms. Both are fast enough for real-time inference, which typically needs features in under 50ms.
Offline Store: Optimized for Training
The offline store keeps the full history of feature values, timestamped for point-in-time correctness:
customer_id | feature_name | feature_value | event_time
------------|--------------------|--------------|-----------
cust_123 | order_count_30d | 5 | 2025-05-20T10:00:00Z
cust_123 | order_count_30d | 6 | 2025-05-20T14:30:00Z
cust_123 | order_count_30d | 6 | 2025-05-21T09:15:00Z
For training, you join features to labels using point-in-time lookups: “what were the feature values for this customer at the moment they placed this order?” This prevents data leakage (using future information to predict past events).
Point-in-Time Correctness
This is where streaming feature stores earn their keep. Point-in-time correctness means: when you train a model, each training example uses only the feature values that were available at the time the event occurred.
Without point-in-time correctness:
-- WRONG: Uses current feature values for historical training examples
SELECT orders.*, features.*
FROM orders
JOIN customer_features features ON orders.customer_id = features.customer_id
With point-in-time correctness:
-- CORRECT: Uses feature values as they existed at order time
SELECT orders.*, features.*
FROM orders
JOIN customer_features_history features
ON orders.customer_id = features.customer_id
AND features.event_time <= orders.created_at
AND features.event_time = (
SELECT MAX(event_time) FROM customer_features_history
WHERE customer_id = orders.customer_id
AND event_time <= orders.created_at
)
A streaming feature store naturally produces this history because it writes a new feature value on every CDC event. The offline store accumulates the full time series, making point-in-time joins straightforward.
Feature Freshness SLAs
Different features need different freshness guarantees:
| Feature | Freshness SLA | Computation |
|---|---|---|
| Account balance | < 1 second | Streaming (direct CDC) |
| Transaction velocity (count/hour) | < 5 seconds | Streaming (windowed aggregation) |
| Cart abandonment risk | < 30 seconds | Streaming (session window) |
| Customer lifetime value | < 1 hour | Micro-batch (15-minute windows) |
| Purchase frequency (90-day) | < 24 hours | Batch is acceptable |
Not every feature needs streaming. The cost of maintaining streaming state for a slowly-changing feature isn’t worth it. Use streaming for features where freshness affects model decisions and batch for everything else.
Multi-Source Feature Computation
Real features often combine data from multiple tables or systems:
orders table (CDC) ──────────┐
├──→ Feature Computation ──→ Online Store
customer_events table (CDC) ──┤
│
product_catalog table (CDC) ──┘
A “purchase propensity” feature might combine:
- Order history from the
orderstable - Browse behavior from
customer_events - Product category affinities from browsing patterns joined with
product_catalog
Stream processing handles this with stream-to-stream joins and lookup joins. The computation maintains state for each stream and produces updated features when any input changes.
Feature Store Implementation Options
Build vs Buy
| Approach | Pros | Cons |
|---|---|---|
| Feast (open source) | Free, flexible, large community | You manage infrastructure, limited streaming support |
| Tecton (managed) | Strong streaming, point-in-time correct | Cost, vendor lock-in |
| Custom (Redis + stream processor) | Full control, no extra dependencies | You build everything yourself |
| Hopsworks (open source + managed) | Good ML integration | Smaller community |
For teams starting out, a custom approach using Redis for online serving and your data lake for offline storage is often simpler than adopting a full feature store framework. You can add a framework later when feature management (discovery, lineage, access control) becomes a pain point.
Minimal Streaming Feature Store
A minimal implementation needs:
- CDC capture — Stream changes from source databases
- Stream processor — Compute features from events (Streaming Agents, custom consumer)
- Online store — Redis or DynamoDB for serving
- Offline store — S3/data lake with timestamped feature values
- Feature registry — Even a simple YAML file listing feature names, types, and sources
# feature_registry.yaml
features:
- name: order_count_30d
entity: customer
type: int
source: orders
computation: streaming
freshness_sla: 5s
- name: avg_order_value_30d
entity: customer
type: float
source: orders
computation: streaming
freshness_sla: 5s
- name: customer_segment
entity: customer
type: string
source: customer_profiles
computation: batch
freshness_sla: 24h
Common Pitfalls
1. Computing Everything in Streaming
Streaming adds complexity. If a feature changes once a day, batch is simpler and cheaper. Reserve streaming for features where staleness directly impacts model performance.
2. Ignoring Feature Monitoring
Features drift. A transaction_count_1h feature that averaged 5 last month now averages 50 because of a promotional event. Without monitoring, your model sees unexpected input distributions and produces unreliable outputs.
Track feature distributions over time and alert on statistical shifts. This connects to broader data observability practices.
3. Tight Coupling to One Model
Design features to be reusable. order_count_30d is useful for fraud detection, churn prediction, and personalization. If you compute it once in the feature store, all three models benefit.
4. Skipping the Offline Store
Without historical feature values, you can’t retrain models with point-in-time correctness. You’ll train on current feature values applied to historical labels, introducing data leakage that inflates your offline metrics and disappoints in production.
When to Invest in Streaming Features
Start with streaming feature computation when:
- Model accuracy is freshness-sensitive — Your fraud model misses more fraud as feature staleness increases
- Batch job timing is a constraint — The 3 AM feature job occasionally runs long and delays the serving update
- Multiple models share features — Computing once and sharing beats each team running their own batch jobs
- You already have CDC — If you’re streaming database changes for other purposes (data replication, analytics), adding feature computation is incremental
Making the Transition from Batch to Streaming Features
Don’t rip out batch overnight. Run streaming features in shadow mode alongside batch: compute both, serve batch, compare the values. When streaming values match batch values (accounting for freshness differences), switch serving to streaming.
This lets you validate correctness before your models depend on the streaming path. It also gives you a fallback — if the streaming pipeline has issues, switch back to batch while you fix it.
Ready to compute ML features from streaming data? Streamkap captures CDC events and delivers them to your feature computation layer with sub-second latency, keeping your online feature store current. Start a free trial or learn more about real-time feature computation.