<--- Back to all resources

AI & Agents

May 22, 2025

12 min read

The Streaming Feature Store Pattern: Real-Time ML Features from CDC

Learn how streaming feature stores eliminate training-serving skew by computing ML features from CDC events in real time instead of batch jobs.

A feature store sits between your raw data and your ML models. Its job: pre-compute features (derived values like “customer lifetime value” or “average order frequency in last 30 days”) and serve them fast enough for real-time inference.

Most feature stores run on batch. A nightly job reads from the warehouse, computes features, and writes them to a serving layer. This works until it doesn’t — and it stops working the moment your models need features that reflect what’s happening now, not what happened 12 hours ago.

Streaming feature stores replace those batch jobs with continuous computation on CDC event streams. Every database change triggers a feature recomputation, and the serving layer updates within seconds.

What a Feature Store Actually Does

Strip away the marketing and a feature store has three responsibilities:

  1. Compute — Transform raw data into features (aggregations, ratios, lookups, time-windowed calculations)
  2. Store — Keep feature values in a format optimized for the access pattern (online serving vs offline training)
  3. Serve — Return feature values at inference time with low latency and at training time with point-in-time correctness
Raw Data → Feature Computation → Feature Store → Model Serving

                                      └──→ Model Training

The computation step is where batch vs streaming matters most.

The Batch Feature Store Problem

Here’s a typical batch feature pipeline:

1. Warehouse ETL runs at 2:00 AM
2. Feature computation job starts at 3:00 AM
3. Features written to online store by 4:00 AM
4. Model uses features that are 4-26 hours stale

This creates two problems:

Problem 1: Staleness

A fraud detection model using “number of transactions in last hour” computed from batch data is always looking at yesterday’s transactions. A customer who made 50 transactions in the last 30 minutes looks identical to one who made 2.

Features that lose value with staleness:

  • Transaction velocity (count per time window)
  • Account balance
  • Inventory levels
  • Session activity (pages viewed, actions taken)
  • Support ticket status
  • Cart contents

Features that tolerate staleness:

  • Customer lifetime value (changes slowly)
  • Industry classification
  • Geographic region
  • Historical averages over long windows (90-day purchase frequency)

Problem 2: Training-Serving Skew

Training-serving skew is the silent model killer. It happens when the features your model trains on don’t match the features it sees during inference.

Example: During training, you compute avg_order_value_30d with a precise SQL window function over historical data. During serving, the batch job computes the same metric but uses a different time boundary (start of batch run vs exact 30 days before the event). The values are close but not identical, and the model’s accuracy degrades by 2-5% without any code change.

Streaming feature stores reduce this gap because the same computation logic runs continuously on the same data source. The online value and the offline value come from the same pipeline, just at different points in time.

Streaming Feature Store Architecture

Source Databases

    ↓  (CDC)
Event Stream (Kafka)

    ├──→ Stream Processor ──→ Online Store (Redis/DynamoDB)
    │    (real-time features)         ↓
    │                           Model Serving

    └──→ Event Log ──→ Offline Store (S3/Data Lake)
         (raw events)         ↓
                        Model Training

CDC as the Foundation

CDC gives you a reliable, ordered stream of every data change. This is the input to your feature computation. Instead of querying the database on a schedule, you react to changes as they happen.

For a deeper look at connecting CDC to downstream systems, see CDC destination patterns.

Stream Processing for Feature Computation

Feature computation on a stream looks different from batch SQL. Instead of SELECT AVG(amount) FROM orders WHERE created_at > NOW() - INTERVAL '30 days', you maintain running state:

# Pseudocode: streaming feature computation
class OrderFeatures:
    def __init__(self, customer_id: str):
        self.customer_id = customer_id
        self.order_count_30d = 0
        self.order_total_30d = 0.0
        self.orders_window: list = []  # (timestamp, amount) pairs

    def process_event(self, event: dict):
        now = event["ts_ms"]

        if event["op"] in ("c", "u"):  # create or update
            order = event["after"]
            self.orders_window.append((now, order["amount"]))

        # Expire events older than 30 days
        cutoff = now - (30 * 24 * 60 * 60 * 1000)  # 30 days in ms
        self.orders_window = [
            (ts, amt) for ts, amt in self.orders_window if ts > cutoff
        ]

        # Recompute features
        self.order_count_30d = len(self.orders_window)
        self.order_total_30d = sum(amt for _, amt in self.orders_window)

    @property
    def avg_order_value_30d(self) -> float:
        if self.order_count_30d == 0:
            return 0.0
        return self.order_total_30d / self.order_count_30d

    def to_feature_vector(self) -> dict:
        return {
            "customer_id": self.customer_id,
            "order_count_30d": self.order_count_30d,
            "order_total_30d": self.order_total_30d,
            "avg_order_value_30d": self.avg_order_value_30d,
        }

This is simplified — production implementations use state backends that handle checkpointing, exactly-once semantics, and efficient windowed aggregations. Tools like Streaming Agents handle this natively.

Online Store: Optimized for Serving

The online store needs to answer one question fast: “What are the current feature values for entity X?”

# Writing features to Redis
def update_online_store(features: dict, redis_client):
    key = f"features:customer:{features['customer_id']}"
    redis_client.hset(key, mapping={
        "order_count_30d": features["order_count_30d"],
        "order_total_30d": features["order_total_30d"],
        "avg_order_value_30d": features["avg_order_value_30d"],
        "updated_at": int(time.time() * 1000),
    })

# Reading features at inference time
def get_features(customer_id: str, redis_client) -> dict:
    key = f"features:customer:{customer_id}"
    return redis_client.hgetall(key)

Redis lookups take 1-5ms. DynamoDB takes 5-15ms. Both are fast enough for real-time inference, which typically needs features in under 50ms.

Offline Store: Optimized for Training

The offline store keeps the full history of feature values, timestamped for point-in-time correctness:

customer_id | feature_name        | feature_value | event_time
------------|--------------------|--------------|-----------
cust_123    | order_count_30d    | 5            | 2025-05-20T10:00:00Z
cust_123    | order_count_30d    | 6            | 2025-05-20T14:30:00Z
cust_123    | order_count_30d    | 6            | 2025-05-21T09:15:00Z

For training, you join features to labels using point-in-time lookups: “what were the feature values for this customer at the moment they placed this order?” This prevents data leakage (using future information to predict past events).

Point-in-Time Correctness

This is where streaming feature stores earn their keep. Point-in-time correctness means: when you train a model, each training example uses only the feature values that were available at the time the event occurred.

Without point-in-time correctness:

-- WRONG: Uses current feature values for historical training examples
SELECT orders.*, features.*
FROM orders
JOIN customer_features features ON orders.customer_id = features.customer_id

With point-in-time correctness:

-- CORRECT: Uses feature values as they existed at order time
SELECT orders.*, features.*
FROM orders
JOIN customer_features_history features
  ON orders.customer_id = features.customer_id
  AND features.event_time <= orders.created_at
  AND features.event_time = (
    SELECT MAX(event_time) FROM customer_features_history
    WHERE customer_id = orders.customer_id
    AND event_time <= orders.created_at
  )

A streaming feature store naturally produces this history because it writes a new feature value on every CDC event. The offline store accumulates the full time series, making point-in-time joins straightforward.

Feature Freshness SLAs

Different features need different freshness guarantees:

FeatureFreshness SLAComputation
Account balance< 1 secondStreaming (direct CDC)
Transaction velocity (count/hour)< 5 secondsStreaming (windowed aggregation)
Cart abandonment risk< 30 secondsStreaming (session window)
Customer lifetime value< 1 hourMicro-batch (15-minute windows)
Purchase frequency (90-day)< 24 hoursBatch is acceptable

Not every feature needs streaming. The cost of maintaining streaming state for a slowly-changing feature isn’t worth it. Use streaming for features where freshness affects model decisions and batch for everything else.

Multi-Source Feature Computation

Real features often combine data from multiple tables or systems:

orders table (CDC) ──────────┐
                              ├──→ Feature Computation ──→ Online Store
customer_events table (CDC) ──┤

product_catalog table (CDC) ──┘

A “purchase propensity” feature might combine:

  • Order history from the orders table
  • Browse behavior from customer_events
  • Product category affinities from browsing patterns joined with product_catalog

Stream processing handles this with stream-to-stream joins and lookup joins. The computation maintains state for each stream and produces updated features when any input changes.

Feature Store Implementation Options

Build vs Buy

ApproachProsCons
Feast (open source)Free, flexible, large communityYou manage infrastructure, limited streaming support
Tecton (managed)Strong streaming, point-in-time correctCost, vendor lock-in
Custom (Redis + stream processor)Full control, no extra dependenciesYou build everything yourself
Hopsworks (open source + managed)Good ML integrationSmaller community

For teams starting out, a custom approach using Redis for online serving and your data lake for offline storage is often simpler than adopting a full feature store framework. You can add a framework later when feature management (discovery, lineage, access control) becomes a pain point.

Minimal Streaming Feature Store

A minimal implementation needs:

  1. CDC capture — Stream changes from source databases
  2. Stream processor — Compute features from events (Streaming Agents, custom consumer)
  3. Online store — Redis or DynamoDB for serving
  4. Offline store — S3/data lake with timestamped feature values
  5. Feature registry — Even a simple YAML file listing feature names, types, and sources
# feature_registry.yaml
features:
  - name: order_count_30d
    entity: customer
    type: int
    source: orders
    computation: streaming
    freshness_sla: 5s

  - name: avg_order_value_30d
    entity: customer
    type: float
    source: orders
    computation: streaming
    freshness_sla: 5s

  - name: customer_segment
    entity: customer
    type: string
    source: customer_profiles
    computation: batch
    freshness_sla: 24h

Common Pitfalls

1. Computing Everything in Streaming

Streaming adds complexity. If a feature changes once a day, batch is simpler and cheaper. Reserve streaming for features where staleness directly impacts model performance.

2. Ignoring Feature Monitoring

Features drift. A transaction_count_1h feature that averaged 5 last month now averages 50 because of a promotional event. Without monitoring, your model sees unexpected input distributions and produces unreliable outputs.

Track feature distributions over time and alert on statistical shifts. This connects to broader data observability practices.

3. Tight Coupling to One Model

Design features to be reusable. order_count_30d is useful for fraud detection, churn prediction, and personalization. If you compute it once in the feature store, all three models benefit.

4. Skipping the Offline Store

Without historical feature values, you can’t retrain models with point-in-time correctness. You’ll train on current feature values applied to historical labels, introducing data leakage that inflates your offline metrics and disappoints in production.

When to Invest in Streaming Features

Start with streaming feature computation when:

  • Model accuracy is freshness-sensitive — Your fraud model misses more fraud as feature staleness increases
  • Batch job timing is a constraint — The 3 AM feature job occasionally runs long and delays the serving update
  • Multiple models share features — Computing once and sharing beats each team running their own batch jobs
  • You already have CDC — If you’re streaming database changes for other purposes (data replication, analytics), adding feature computation is incremental

Making the Transition from Batch to Streaming Features

Don’t rip out batch overnight. Run streaming features in shadow mode alongside batch: compute both, serve batch, compare the values. When streaming values match batch values (accounting for freshness differences), switch serving to streaming.

This lets you validate correctness before your models depend on the streaming path. It also gives you a fallback — if the streaming pipeline has issues, switch back to batch while you fix it.


Ready to compute ML features from streaming data? Streamkap captures CDC events and delivers them to your feature computation layer with sub-second latency, keeping your online feature store current. Start a free trial or learn more about real-time feature computation.