<--- Back to all resources
CDC for ML Feature Pipelines: Real-Time Feature Engineering from Database Changes
Learn how Change Data Capture powers real-time ML feature pipelines. Build fresh feature stores, reduce training-serving skew, and improve model performance with streaming data.
Your fraud detection model worked perfectly in testing. It caught 97% of fraudulent transactions during evaluation—a number that made everyone on the data science team feel quietly confident. You deployed to production, celebrated, and waited for the results to roll in.
Three weeks later, the model is missing obvious fraud patterns. Chargebacks are climbing. The model’s precision has dropped below 80%, and nobody changed a single line of code. So what happened?
The data feeding the model went stale. Your batch pipeline updates features every six hours, so the model’s view of the world always lags behind reality. A fraudster racks up twelve transactions in twenty minutes, but the model still thinks the customer’s recent transaction count is zero—because the last batch ran at midnight. This is the feature freshness problem, and it is one of the most common reasons ML models underperform in production. The good news? Change Data Capture (CDC) solves it cleanly, and this guide shows you exactly how.
What Are ML Feature Pipelines?
In machine learning, a feature is an input variable that a model uses to make predictions. Features are the model’s window into reality—processed, structured signals derived from raw data. Here are a few examples:
- “Number of transactions in the last 24 hours” — computed from a
transactionstable - “Average order value over the past 30 days” — computed from an
orderstable - “Days since last login” — computed from a
userstable - “Ratio of failed to successful payments this week” — computed by joining
paymentsandtransactionstables
None of these features exist directly in a database column. They are all derived—computed from raw data through aggregation, transformation, or combination. The system that performs this computation is the feature pipeline.
A feature pipeline connects raw operational data (in your production databases) to the model that needs to consume it, following three stages:
- Ingestion: Pull raw data from source systems (databases, event streams, APIs)
- Transformation: Compute the actual features—aggregations, joins, encodings, normalization
- Storage and Serving: Write the computed features to a feature store where models can access them during training and inference
Simple enough in concept. Deceptively hard in practice. And the hardest part? Keeping those features fresh.
The Feature Freshness Problem
Here’s the uncomfortable truth about most ML feature pipelines in production today: they are batch processes. A cron job runs every few hours (or once a day, if we’re being honest), queries the source database, computes the features, and writes the results to a feature store. Between runs, the features are stale.
This might seem like a minor inconvenience, but it creates a cascade of problems that quietly sabotage model performance.
Training-Serving Skew
This is the big one. Training-serving skew occurs when the data a model sees during training differs from the data it sees during inference. The most common cause is temporal skew from stale features.
During training, you compute features from historical data using batch queries with access to the full timeline. Your training features are point-in-time correct. The model learns these patterns. Then you deploy to production, where the features feeding it are hours old—because the batch pipeline hasn’t run yet. The model was trained in one reality and is being served a different one.
Research from Google and other ML-heavy organizations consistently identifies training-serving skew as one of the top causes of model degradation in production. It is the default state of affairs for any team using batch feature pipelines.
Point-in-Time Correctness
A subtler issue is data leakage during training. When you compute features from historical data using batch queries, it is easy to accidentally “leak” future information into training features. If you compute “average order value over 30 days” for a training sample dated January 15, but your query pulls orders through January 31, you have given the model information it would not have had at prediction time.
CDC helps here because every event carries a precise timestamp from the database transaction log. When you replay a CDC stream to rebuild training features, you can reconstruct exactly what was known at any given point in time—no future data leakage, no artificial performance inflation.
The Real Cost of Stale Features
Feature staleness isn’t an abstract, academic problem. It translates directly into business outcomes:
- Fraud detection: A fraud model checking features that are six hours old is essentially blind to any fraudulent activity that happened in the last six hours. Fraudsters know this and exploit it.
- Recommendations: A product recommendation model using yesterday’s browsing behavior suggests items the user has already purchased or lost interest in. Conversion rates drop.
- Risk scoring: A credit risk model using features from last night’s batch run cannot account for the five new loan applications the borrower submitted this morning. Exposure increases.
- Personalization: A pricing or content personalization model using stale customer segments delivers irrelevant experiences. Engagement declines.
In every case, the model itself is fine. The features feeding it are the problem. And the solution is to make those features real-time.
Online vs. Offline Feature Stores
Modern ML platforms use a dual-store architecture to serve features. Understanding this architecture is essential to seeing where CDC fits in.
The Offline Store
The offline feature store is used for model training. It stores historical feature values with point-in-time query support—meaning you can ask, “What was this feature’s value at 3:00 PM on January 15?” Offline stores are typically built on columnar systems like Snowflake, BigQuery, Databricks / Delta Lake, or Apache Iceberg on S3. They need to be comprehensive and correct, but they don’t need to be fast—throughput matters more than latency.
The Online Store
The online feature store is used for model inference (serving). When a model scores a prediction in real time, it looks up the latest feature values from the online store in single-digit milliseconds. Online stores are typically Redis, DynamoDB, Cassandra, or feature store platforms like Feast or Tecton. They need to be fast and fresh, but don’t need deep history.
The Consistency Challenge
You have two stores with different characteristics, but both need to reflect the same underlying truth. If the online store says a customer has made 3 transactions today, but the offline store (used for training) would have said 5, you have training-serving skew.
With CDC, you have a single source of truth: the stream of database changes. This stream can be routed to both the online store (for serving) and the offline store (for training). Both stores receive the same data, processed through the same feature computation logic, ensuring consistency with each other and with the production database.
How CDC Powers Real-Time Feature Pipelines
Now let’s put it all together. Here is the architecture for a CDC-powered real-time feature pipeline, from source database to model serving.
The End-to-End Flow
Source Database —> CDC (reads transaction log) —> Stream Processing (compute features in-flight) —> Feature Store (online + offline)
Let’s walk through each stage:
1. CDC Captures Every Database Change
Change Data Capture reads the transaction log of your source database—PostgreSQL’s WAL, MySQL’s binlog, MongoDB’s oplog—and emits every INSERT, UPDATE, and DELETE as a structured event with sub-second latency and minimal impact on the source. Each event includes full row data (before and after the change) and a precise timestamp, critical for point-in-time feature computation.
2. Stream Processing Computes Features In-Flight
Raw CDC events are not features. You need a computation layer that transforms changes into derived values. With Streamkap’s managed Apache Flink, you define feature computations on the CDC stream using SQL, Python, or TypeScript. These run continuously, processing each event as it arrives:
- Windowed aggregations: “Count of transactions in the last 1 hour” or “Sum of order amounts in the last 24 hours”
- Joins: Enrich a transaction event with customer profile data from another CDC stream
- Filters: Exclude test accounts, internal transactions, or flagged records
- Derived calculations: Compute ratios, differences, running averages, or flags
Because Flink is a stateful stream processor, it can maintain the running state needed for these computations (like a sliding window count) without any external storage. The features are computed in real time, event by event, as the CDC stream flows through.
3. Results Flow to Both Online and Offline Stores
The computed features are written simultaneously to your online store (for real-time model serving) and your offline store (for model training). Streamkap supports delivery to all the destinations that ML teams commonly use:
- Online: Redis, DynamoDB, Kafka (for downstream consumers)
- Offline: Snowflake, BigQuery, Databricks, ClickHouse, Apache Iceberg
Both stores receive the same computed features from the same stream, ensuring consistency.
A Concrete Example
Let’s trace a single event through this pipeline:
- A customer makes a purchase. The
transactionstable in PostgreSQL gets a new row. - CDC captures the INSERT event from PostgreSQL’s WAL within milliseconds. The event includes the transaction amount, customer ID, timestamp, merchant category, and payment method.
- Managed Flink receives the event and computes several features:
txn_count_1h: number of transactions by this customer in the last hour (windowed count) — now updated to 4txn_avg_amount_24h: average transaction amount over 24 hours (windowed average) — now $47.82txn_distinct_merchants_1h: number of distinct merchants in the last hour — now 3txn_amount_vs_avg_ratio: ratio of this transaction’s amount to the 24-hour average — 2.1 (this transaction was above average)
- These four feature values are written to Redis (keyed by customer ID) for the fraud detection model to look up at inference time.
- The same values are also written to Snowflake for the offline feature store, appended with the event timestamp for point-in-time correctness.
The entire process, from the database INSERT to updated features in both stores, happens in sub-second timeframes. When the fraud model needs to evaluate the next transaction from this customer (maybe 30 seconds later), it sees the updated feature values immediately.
Feature Engineering with Stream Processing
The feature computation layer is where the real magic happens in a CDC-powered feature pipeline. Let’s dig into the most common real-time feature engineering patterns and how they work with streaming data.
Windowed Aggregations
Windowed aggregations are the bread and butter of real-time feature engineering. They compute summary statistics over time windows:
- Tumbling windows: Fixed, non-overlapping intervals. “Total transaction amount per 1-hour block.”
- Sliding windows: Overlapping intervals that update continuously. “Count of events in the last 60 minutes, re-evaluated every minute.”
- Session windows: Dynamic intervals defined by activity gaps. “Total actions per user session, where a session ends after 30 minutes of inactivity.”
| Feature Type | Example | Window |
|---|---|---|
| Count | Number of login attempts | Last 15 minutes |
| Sum | Total purchase amount | Last 24 hours |
| Average | Mean transaction value | Last 7 days |
| Min/Max | Highest single transaction | Last 1 hour |
| Distinct count | Unique IP addresses used | Last 30 minutes |
With Streamkap’s managed Flink, you define windowed aggregations using standard SQL. The platform handles state management, watermarking, and late event handling.
Stream Joins and Enrichment
Some of the most powerful features come from combining data across multiple CDC streams simultaneously:
- Transaction + Customer Profile: Join transactions with customer attributes (account age, loyalty tier, region) to enrich each event.
- Order + Product Catalog: Join order events with product data to compute category-level features.
- Activity + Subscription: Join user activity with subscription data to compute engagement features relative to plan tier.
Flink maintains the latest state of each stream, so when a new event arrives, it can instantly look up related data without hitting any external database.
Derived and Composite Features
Stream processing also enables complex derived features that combine multiple signals:
- Velocity features: “Rate of transaction frequency increase compared to the 7-day average.”
- Ratio features: “Current session page views divided by the customer’s historical average.”
- Boolean flags: “Has this customer transacted in more than 3 countries in the last 24 hours?”
- Recency features: “Seconds since this customer’s last transaction.”
These composite features often carry the most predictive signal because they capture changes in behavior, not just static snapshots. By definition, you can’t compute “rate of change over the last 30 minutes” from a batch pipeline that runs every 6 hours.
How Streamkap Handles Feature Computation
Streamkap’s managed Apache Flink gives ML teams three ways to define feature computations:
- SQL: Write standard SQL queries with window functions and joins. Ideal for analysts and data scientists who think in SQL.
- Python: Use Python UDFs for complex feature logic, ML-specific transformations, or integration with libraries like NumPy or Pandas.
- TypeScript: Use TypeScript for teams that prefer type-safe, programmatic transformations.
All three run on fully managed Flink infrastructure at $250/vCPU/month, with no clusters to provision, no state backends to configure, and no checkpointing to manage. You write the feature logic; Streamkap runs it.
Solving Training-Serving Skew with CDC
Training-serving skew is the single biggest reason ML teams should care about CDC for feature pipelines. It’s fundamentally a consistency problem: during training, features were computed from data at one point in time; during serving, they are computed from a different (usually staler) point in time.
In a batch pipeline, this gap is structural. The offline store gets fresh features when the training batch runs. The online store gets fresh features only when the serving batch runs. Between batches, features are stale and predictions suffer.
How CDC Closes the Gap
With CDC, both the training path and the serving path are fed from the same continuous stream of database changes. The feature computation logic is identical. The only difference is the destination: online store for serving, offline store for training.
This means:
- Serving features are always fresh. The online store is updated continuously, event by event, with sub-second latency. There is no batch window to wait for.
- Training features are point-in-time correct. The offline store receives every feature update with an accurate timestamp from the database transaction log. When you build a training dataset, you can reconstruct exactly what the model would have seen at any historical moment.
- Both stores are consistent. Because they are fed from the same stream and the same computation logic, the feature values in the online and offline stores agree with each other. What the model sees in training is what it sees in production.
Backfilling and Retraining
One of the underappreciated benefits of a CDC-based feature pipeline is the ability to replay the CDC stream to rebuild features from scratch. This is enormously valuable for retraining.
When you need to retrain a model with a new feature definition—say you want to change the window from 1 hour to 30 minutes—you can replay the stored CDC events through your updated Flink job and backfill the offline store with the new feature values. This is significantly faster and more accurate than re-running batch queries against the source database, which may have changed since the events originally occurred.
Streamkap’s native support for Apache Iceberg makes this particularly powerful. Iceberg’s time travel capability lets you query the offline feature store as it existed at any point in the past, while ACID transactions ensure that backfill operations don’t corrupt the existing data. You can run a backfill alongside live feature computation without any conflict.
Measuring the Impact
Teams that migrate from batch to CDC-based feature pipelines commonly report:
- Feature freshness improvement: From hours to sub-second
- Model accuracy recovery: Models return to their evaluation-time performance because the training-serving gap is eliminated
- Reduced monitoring overhead: Fewer alerts and less manual investigation because feature distributions remain stable between training and serving
- Faster iteration: New features can be tested in production within minutes, not days
The improvement is not incremental—it’s structural. You’re not just making the batch pipeline faster; you’re replacing a fundamentally flawed architecture with one that is correct by design.
Real-World ML Feature Pipeline Patterns
Let’s get concrete. Here are four production patterns where CDC-powered feature pipelines deliver measurable impact.
Pattern 1: Real-Time Fraud Detection
Source: CDC from a transactions database (PostgreSQL or MySQL)
Features computed via Flink:
- Transaction velocity (count per 1-minute, 15-minute, and 1-hour windows)
- Transaction amount deviation from 7-day average
- Distinct merchant count in last 30 minutes
- Geographic spread (number of distinct countries in last 24 hours)
- Card-not-present ratio in last 1 hour
Online store: Redis (sub-5ms lookups for real-time scoring)
Offline store: Snowflake (historical features for model retraining)
Why CDC matters: Fraud patterns evolve in minutes, not hours. A batch pipeline that updates every 6 hours gives fraudsters a 6-hour window to exploit stale models. CDC shrinks that window to sub-second.
Pattern 2: Recommendation Engine
Source: CDC from a user_activity database and product_catalog database (MongoDB)
Features computed via Flink:
- Items viewed in last session (session window aggregation)
- Category affinity scores (weighted interactions per category over 30 days)
- Purchase frequency by category (7-day tumbling window)
- Time since last interaction per category
Online store: DynamoDB (fast lookups for real-time recommendations)
Offline store: Databricks (training data for the recommendation model)
Why CDC matters: User preferences shift rapidly during sales, seasonal changes, or viral trends. A model using yesterday’s engagement features suggests irrelevant items. CDC keeps features current so recommendations reflect what users are interested in right now.
Pattern 3: Credit Risk Scoring
Source: CDC from a loans database and a payments database (SQL Server)
Features computed via Flink:
- Total outstanding exposure across all active loans
- Payment delinquency count in last 90 days
- Recent application velocity (loan applications in last 7 days)
- Debt-to-income ratio (joining loan data with income records)
- Payment pattern regularity (standard deviation of days between payments)
Online store: Redis (real-time risk scoring at loan origination)
Offline store: BigQuery (regulatory reporting and model retraining)
Why CDC matters: A borrower’s risk profile can change dramatically within a single day. Batch-computed risk features miss intraday changes, leading to increased exposure. CDC ensures real-time evaluation based on current financial state.
Pattern 4: Customer Personalization
Source: CDC from customers, orders, and support_tickets databases (PostgreSQL)
Features computed via Flink:
- Customer lifetime value (running sum of order amounts)
- Purchase recency and frequency (RFM metrics computed continuously)
- Churn risk signals (declining order frequency, increasing support contacts)
- Engagement score (composite of login frequency, feature usage, order activity)
Online store: Redis (powering personalized UI, pricing, and messaging)
Offline store: Apache Iceberg on S3 (training data for personalization models)
Why CDC matters: A customer who just filed a support complaint should not receive a cheerful upsell email 10 minutes later because the batch pipeline hasn’t propagated their sentiment score. CDC keeps personalization signals current.
Feature Pipeline Architecture: Batch vs. Streaming vs. Hybrid
Not every feature pipeline needs to be real-time. Here’s an honest comparison of three approaches.
| Approach | Feature Freshness | Complexity | Best For |
|---|---|---|---|
| Batch-only | Hours to days | Low — familiar tools (Airflow, dbt) | Slow-changing features, offline-only models |
| Streaming-only | Sub-second | Higher — requires stream processing expertise | Features where staleness has measurable cost |
| Hybrid (recommended) | Mixed — seconds for real-time, hours for batch | Medium — two systems, each doing what it does best | Most production ML systems |
Batch pipelines are the right choice when features genuinely don’t need to be fresh. If you predict quarterly churn risk, daily updates are adequate. Don’t over-engineer.
Streaming-only pipelines make sense when feature freshness directly impacts business outcomes—fraud detection, real-time pricing, personalization. The trade-off is operational complexity, which is why managed platforms matter.
The hybrid approach is what most mature ML teams converge on. You don’t need sub-second freshness for “customer’s home state” or “account creation date.” But “transaction count in the last hour” absolutely needs real-time updates. CDC powers the features that need freshness, batch handles the rest, and the feature store unifies both—presenting a single interface to the model regardless of how each feature was computed.
Why Managed CDC for Feature Pipelines
Building a real-time feature pipeline from scratch means setting up and managing Debezium for CDC capture, Apache Kafka for the event stream, Apache Flink for stream processing, state backends for Flink’s stateful computations, Schema Registry for schema changes, monitoring for all of the above, and on-call rotations for when things break at 3 AM.
That’s months of engineering work before you’ve computed a single feature. ML teams should spend their time on feature engineering and model development, not distributed systems infrastructure. That’s the core argument for managed CDC.
What Streamkap Handles for You
Streamkap collapses the entire infrastructure stack into a managed platform:
- CDC capture from PostgreSQL, MySQL, MongoDB, DynamoDB, SQL Server, Oracle, and 50+ other sources — with sub-250ms end-to-end latency
- Managed Apache Flink for feature computation — write SQL, Python, or TypeScript; Streamkap runs the infrastructure
- Delivery to any feature store — Snowflake, BigQuery, Databricks, ClickHouse, Apache Iceberg, Kafka, and more
- Automatic schema evolution — when your source tables change, the pipeline adapts without breaking
- Self-healing pipelines — automatic recovery from transient failures without manual intervention
- API and Terraform provider — automate pipeline creation and management through CI/CD, treating your feature pipelines as code
The setup time goes from months (DIY) to minutes (Streamkap). And you get sub-second latency out of the box, which is the entire point of building real-time feature pipelines in the first place.
The Cost Comparison
A self-managed CDC + Flink stack requires at least 2-3 dedicated infrastructure engineers—$400,000-$600,000+ per year in salary alone. Streamkap’s pricing starts at $600/month for Starter and $1,800/month for Scale, with managed Flink at $250/vCPU/month. Spend your engineering budget on feature engineering and model improvement, not on keeping Kafka and Flink clusters healthy.
Getting Started: Your First CDC-Powered Feature Pipeline
Ready to build a real-time feature pipeline? Here’s a practical, step-by-step path to get from zero to production.
Step 1: Identify Your High-Value Feature Sources
Identify which database tables drive your most important features. Focus on tables where data changes frequently and freshness matters: transaction tables (fraud), user activity tables (recommendations), order tables (demand forecasting), and account tables (risk scoring). Pick 1-2 high-impact tables and prove the value first.
Step 2: Connect CDC from Those Sources
With Streamkap, add your source database (PostgreSQL, MySQL, MongoDB, etc.) as a connector, select the tables to capture, and CDC begins streaming changes immediately—no database modifications required for most sources.
Step 3: Define Feature Computations
Write feature transformation logic using Streamkap’s managed Flink. Start with basic windowed aggregations, then add joins and derived features as you identify what improves model performance. Use SQL for straightforward transformations, Python for complex logic, or TypeScript for type-safe pipelines.
Step 4: Route Features to Your Stores
Configure delivery to both stores: route computed features to Redis or DynamoDB (online, for serving) and to Snowflake, BigQuery, Databricks, or Iceberg (offline, for training). Both destinations receive the same features, ensuring consistency from day one.
Step 5: Point Your Models at Fresh Features
Update your model serving infrastructure to read from the online feature store. Monitor accuracy metrics, track feature freshness, and measure training-serving skew. You should see an immediate improvement—not because the model changed, but because it’s finally seeing the reality it was trained to understand.
The Bottom Line
ML model performance is only as good as the features feeding the model. When features are stale, models degrade—not because the model is wrong, but because its view of the world is outdated. CDC solves the freshness problem at the source, enabling feature pipelines that are always current.
Combined with managed stream processing, you can compute features in real time and deliver them to both online and offline feature stores simultaneously—keeping training and serving consistent. Managed CDC platforms like Streamkap mean you can have production-ready, real-time feature pipelines running in minutes, with sub-second latency, automatic schema evolution, and self-healing reliability.
If your models are underperforming in production, the culprit might not be the model—it might be the data. Start with CDC, and give your models the fresh features they were trained to expect.
Start your free trial and build your first real-time feature pipeline today. Or explore how Streamkap powers AI/ML pipelines with purpose-built CDC infrastructure.