<--- Back to all resources
From Data Freshness to Context Freshness: The Next Wave of Real-Time Infrastructure
Wave 1 was fresh data (CDC to warehouses). Wave 2 is fresh context (streaming semantic layers for agents). Here's why context freshness is the next frontier and what it means for your data stack.
In 2015, if you asked a data engineer about real-time data, they would describe the dream of replacing nightly batch ETL jobs with something faster. Dashboards that updated every hour instead of every morning. Reports that reflected yesterday’s data instead of last week’s. The gap between “something happens in production” and “an analyst can see it” was measured in hours or days.
By 2023, that dream was largely realized. CDC platforms like Debezium, Streamkap, and managed offerings from the major cloud providers made it possible to stream database changes to warehouses in seconds. Kafka became the central nervous system of data infrastructure. Real-time dashboards went from aspirational to expected.
This was Wave 1 of real-time infrastructure. It was about moving data faster. And it worked. But it turns out that fresh data alone doesn’t solve the problems that matter most in 2026.
Wave 1: Fresh Data (2015 to 2023)
The first wave was defined by a simple metric: data latency. How long does it take for a change in a production database to be queryable in an analytical system?
The starting point: Nightly batch ETL. A cron job runs at 2 AM, extracts everything from the production database, transforms it, and loads it into the warehouse. Data latency: 12 to 24 hours.
The middle ground: Micro-batch ETL. Tools like Fivetran and Airbyte poll source databases every 5 to 15 minutes. Data latency: 15 minutes to 1 hour.
The destination: Streaming CDC. Debezium, Streamkap, and similar tools read the database’s transaction log and stream changes as they happen. Data latency: 1 to 5 seconds.
The primary consumer in Wave 1 was a human analyst looking at a dashboard. And for that use case, the progression from daily to hourly to sub-minute freshness was genuinely valuable. Operational dashboards became possible. Finance teams could close the books faster. Support teams could see recent customer activity without waiting for the next batch load.
The companies that won Wave 1 did so by making data movement reliable, affordable, and easy to set up. The technology challenge was significant, as reading transaction logs from production databases without impacting performance, handling schema evolution, ensuring exactly-once delivery, and managing backfills is non-trivial engineering. But the problem was well-defined: move bytes from A to B, fast.
Success metric for Wave 1: Data latency in seconds. How fast does a committed transaction become queryable?
The Limits of Fresh Data
By the early 2020s, the leading organizations had achieved sub-second data latency. Their warehouses reflected production reality within seconds. And yet, the analysts and decision-makers consuming that data were still struggling.
Why? Because fresh data alone doesn’t tell you what the data means.
A warehouse table updated in real time is still just rows and columns. If you don’t know that the amount column in the orders table represents gross revenue before tax and returns, you’ll calculate the wrong total. If you don’t know that the customers table includes test accounts, your customer count will be wrong. If you don’t know that the fiscal year starts in February, not January, your quarterly reports will slice the data incorrectly.
Humans compensated for this. Experienced analysts carried business context in their heads. They knew which tables to trust, which columns meant what, and which calculations had tricky edge cases. The data was fresh, and the human added the context.
Then AI agents showed up, and the human-as-context-layer pattern broke completely.
Wave 2: Fresh Context (2024 to 2026)
AI agents don’t have institutional knowledge. They don’t know your business conventions. They can’t call a colleague to ask which table is the source of truth. Every query is their first query, and they need everything explained explicitly.
This created a new requirement: not just fresh data, but fresh understanding of what that data means. The metric that matters shifted from data latency to context latency.
Context latency is the time between when something changes about your business (a new product launches, a metric definition evolves, a schema migrates, a business rule updates) and when the systems consuming your data understand that change.
In most organizations today, context latency is measured in days or weeks:
- A new product launches on Monday. The semantic layer is updated on Thursday when the data team gets to the ticket.
- A schema migration renames columns on Wednesday. The agent’s metadata is corrected the following Monday.
- The finance team changes how they calculate net revenue. The documentation is updated next quarter.
This was tolerable when humans were the consumer, because humans are adaptive. They hear about the new product launch in a Slack channel. They notice the schema change when their query breaks and fix it. They get the memo about the new revenue calculation.
Agents don’t hear about things in Slack. They don’t notice anomalies in the same way. They work with whatever metadata and definitions they’ve been given, and if those definitions are stale, the answers are wrong.
Wave 2 is about making context as fresh as data. When a new product launches and new tables appear, the agent’s understanding of revenue should update within seconds, not days. When a schema migrates, the agent’s metadata should reflect the new column names immediately. When a business rule changes, the agent should apply the new rule from that moment forward.
The technology building blocks for Wave 2 include:
- CDC with schema evolution tracking. CDC platforms already capture schema changes alongside data changes. The missing piece is routing those schema events to semantic layers and agent metadata stores.
- Streaming semantic layers. Metric definitions that update in real time as the underlying data and schemas evolve. This is where traditional semantic layers (dbt metrics, Cube) need to go.
- Context delivery protocols. Standards like MCP (Model Context Protocol) that give agents structured access to both data and the context about that data. Not just “here’s the query result” but “here’s the result, here’s what the metric means, here’s how current the data is, and here’s what changed since the last query.”
Success metric for Wave 2: Context latency in seconds. How fast does a new business concept, changed definition, or evolved schema become available to agents?
The Compounding Effect
Fresh data with stale context produces a specific class of errors: technically current numbers that are wrong because they’re calculated or interpreted incorrectly.
Stale data with fresh context produces a different class: correctly calculated numbers that don’t reflect the current state of the business.
Only the combination, fresh data AND fresh context, produces answers that are both current and correct.
This is why Wave 2 isn’t a replacement for Wave 1. It’s additive. You need CDC for data freshness AND context streaming for context freshness. The infrastructure stack grows:
| Layer | Wave 1 | Wave 2 | Both |
|---|---|---|---|
| Data latency | Seconds | N/A | Seconds |
| Context latency | Days/weeks | Seconds | Seconds |
| Agent accuracy | Low (fresh but misunderstood data) | Low (correct interpretation of stale data) | High |
Companies that invested in Wave 1 infrastructure have a head start on Wave 2 because they already have the streaming foundation. CDC pipelines that capture schema evolution events are the raw material for context streaming. The incremental step is routing those events to semantic layers and agent metadata stores.
Companies that skipped Wave 1 and are still running batch ETL face a compounding deficit: they need to solve both data freshness and context freshness simultaneously, against competitors who are already streaming both.
Wave 3: Fresh Decisions (Emerging)
There’s a third wave forming that goes beyond data and context to the decisions themselves.
In Waves 1 and 2, agents query data and context, then make decisions based on what they find. The decision-making logic lives in the agent’s code, its prompts, its tool definitions. That logic is essentially static between deployments. If you want an agent to change how it makes decisions, you update its code or prompts and redeploy.
Wave 3 asks: what if the decision logic itself was streaming? What if agents could evolve their decision patterns in real time based on outcomes?
Consider an inventory restocking agent. In Waves 1 and 2, this agent queries fresh inventory data, consults a context layer to understand reorder points and supplier lead times, and triggers a restock order when inventory drops below a threshold. The threshold is configured statically.
In Wave 3, the threshold adjusts continuously based on real-time signals. A sudden spike in demand (detected via streaming CDC from the orders database) triggers an immediate threshold increase. A supplier delay (detected via an event from the procurement system) shifts orders to alternative suppliers. Seasonal patterns, detected through streaming aggregation of historical data, pre-adjust inventory levels before demand materializes.
This is where Apache Flink enters the picture. Flink provides the streaming compute layer for decision logic that operates on continuous event streams. A Flink job can:
- Aggregate signals from multiple real-time sources
- Apply windowed calculations (trending demand over the last 4 hours)
- Pattern-match across event sequences (this combination of events has preceded stockouts 80% of the time)
- Trigger actions based on complex, evolving conditions
The agent isn’t just querying fresh data with fresh context. It’s running continuous computation over event streams and making decisions that evolve as conditions change. This is qualitatively different from a request-response agent that queries a warehouse.
Success metric for Wave 3: Decision latency. How fast does a change in conditions translate to a changed decision or action?
The Industry Map
These three waves map onto distinct layers of the data infrastructure stack, and different companies are positioned at different layers:
Wave 1 players (data freshness): Fivetran, Airbyte, Debezium, database-native CDC, cloud-native streaming services. This layer is maturing and beginning to commoditize. Competition is increasingly on price, reliability, and breadth of connectors.
Wave 2 players (context freshness): dbt (metrics layer), Cube, AtScale, plus the emerging category of agent-focused metadata services. This layer is nascent, and nobody has fully captured it yet. The winner will be whoever integrates streaming data change events into semantic definitions most effectively.
Wave 3 players (decision freshness): Confluent (ksqlDB), Apache Flink ecosystem, and the emerging “agent compute” category. This layer is early and mostly occupied by infrastructure-heavy platforms that require significant expertise to operate.
The strategic question for data platform companies is: which waves do you serve?
Companies that only serve Wave 1 will face margin pressure as data movement commoditizes. The differentiation is no longer “can you stream CDC?” but “what can you do with the stream?”
Companies that serve Waves 1 and 2 can charge for value (agent accuracy) rather than volume (bytes moved). The context layer is where business-specific knowledge lives, and it’s harder to commoditize because every company’s context is different.
Companies that serve all three waves own the full pipeline from raw database changes to autonomous decisions. This is the most valuable position and the hardest to build.
Where Streamkap Fits
Streamkap’s trajectory maps directly onto these three waves:
Wave 1 (today): Production-grade CDC from PostgreSQL, MySQL, MongoDB, and other databases to warehouses, lakes, and operational stores. This is the foundation. Data latency measured in seconds. Schema evolution handled automatically. Exactly-once delivery guarantees.
Wave 2 (building): Context streaming that routes schema changes, metadata updates, and structural events to semantic layers and agent metadata stores. As agents become a primary consumer of data infrastructure, keeping their context current becomes as important as keeping their data current.
Wave 3 (roadmap): Managed Flink for agent compute. Streaming decision logic that operates on continuous event streams, applies evolving rules, and triggers actions autonomously. This is where agents go from “query and respond” to “continuously process and decide.”
This isn’t just a product roadmap. It’s a thesis about how the data infrastructure industry evolves. The center of gravity is shifting up the stack, from raw data movement to contextualized data to autonomous decisions. Companies that anticipate this shift and build accordingly will define the next era of data infrastructure.
What This Means for You
If you’re running batch ETL today, you’re behind on Wave 1. Start here. Set up CDC from your production databases. Streamkap can have a pipeline running in minutes. The immediate payoff is fresher data and a foundation for everything that follows.
If you have streaming data but your agents are inaccurate, you’re in the Wave 1 to Wave 2 gap. Your data is fresh but your context is stale. Start building a context layer, even a simple one. Document your key metrics and make those definitions available to agents. Then connect CDC schema evolution events to keep those definitions current.
If you’re thinking about agents that operate autonomously (restocking inventory, adjusting pricing, routing support tickets), you’re looking at Wave 3. This requires streaming compute (Flink) on top of fresh data and fresh context. It’s the most complex layer, but it’s also where the highest-value use cases live.
The progression is sequential. You can’t effectively do Wave 2 without Wave 1 (fresh context on stale data is still wrong). You can’t do Wave 3 without Waves 1 and 2 (autonomous decisions need both fresh data and fresh context).
Start where you are. Stream everything you can. Define what matters. The infrastructure you build for today’s use cases becomes the foundation for what’s next.
Ready to build the foundation for all three waves? Streamkap delivers production-grade CDC with sub-second latency, schema evolution tracking, and a roadmap toward context streaming and managed Flink. Start a free trial or learn more about agents.