<--- Back to all resources

AI & Agents

March 10, 2026

10 min read

Streaming Semantic Layers: Why Batch Definitions Break AI Agents

Semantic layers were designed for BI tools and human analysts. When AI agents become the consumer, batch-updated definitions create a new class of failures. Here's why streaming semantic layers are the next evolution.

TL;DR: • Traditional semantic layers (dbt metrics, Cube, AtScale, Looker) define metrics once so every dashboard uses the same calculation. They work well for BI because humans query occasionally, definitions change slowly, and hourly freshness is acceptable. • AI agents break all three assumptions. They query thousands of times per hour, the data they need context for changes constantly, and they require sub-second freshness for real-time decisions. Batch-updated semantic layers create a new class of failures when agents are the consumer. • Streaming semantic layers update definitions in real time as underlying data evolves. CDC detects schema changes and metadata updates and pushes them to the semantic layer automatically. This is new territory, but the direction is clear.

The semantic layer is one of the genuinely useful concepts to emerge from the modern data stack era. The idea is simple and correct: define your business metrics and dimensions in one place, and make every consumer, whether it’s a dashboard, a report, or an ad-hoc query, use those same definitions. No more “my revenue number doesn’t match your revenue number” conversations.

dbt metrics, Cube, AtScale, Looker’s modeling layer, and similar tools have brought this concept to production. Data teams define what “revenue” means, how “churn” is calculated, which table is the source of truth for “customers,” and these definitions become the shared foundation for every downstream consumer.

This works remarkably well for BI. It was designed for BI. And that’s precisely the problem.

The Three Assumptions

Traditional semantic layers were built on three assumptions that were perfectly reasonable when the primary consumer was a human analyst using a BI tool:

Assumption 1: Consumers query occasionally. A BI dashboard refreshes every few minutes. An analyst runs a query a few times a day. The semantic layer needs to handle hundreds or maybe thousands of queries per hour across the entire organization.

Assumption 2: Definitions change slowly. Your company’s definition of “revenue” doesn’t change every day. New products launch quarterly. Fiscal year boundaries shift annually. Schema changes happen during planned migrations. The semantic layer can be updated on a schedule, with a human reviewing changes before they go live.

Assumption 3: Freshness is measured in hours. Nobody complains that their Monday morning dashboard reflects data as of 6 AM. Hourly refreshes are considered “near real-time” by BI standards. Most organizations are happy with daily refreshes for anything other than operational dashboards.

These assumptions held for years. Then AI agents entered the picture.

How Agents Break Every Assumption

AI agents are a fundamentally different consumer than human analysts, and they violate all three assumptions simultaneously.

Agents query constantly. A single agent handling customer support might issue hundreds of queries per hour. A fleet of agents across sales, support, finance, and operations could generate tens of thousands of queries per hour. And these aren’t scheduled dashboard refreshes; they’re ad-hoc queries driven by real-time user interactions. The query volume alone can stress semantic layer implementations that were designed for BI workloads.

The underlying data changes constantly. Not the definitions themselves (though those change too), but the data that the definitions describe. New products launch. New customers sign up. New data sources get connected. New columns appear in tables as the engineering team ships features. The business is a moving target, and the semantic layer’s description of it needs to keep up.

Agents need sub-second freshness. When a customer asks an AI support agent about an order they placed two minutes ago, the agent needs to see that order. When an inventory agent is deciding whether to trigger a restock alert, it needs current stock levels. Hourly freshness isn’t just suboptimal for agents; it’s a source of actively wrong decisions.

Specific Failure Modes

Let’s trace through what happens when batch-updated semantic layers meet real-time agent consumers.

New Product Launch

Your company launches a new product at noon on Tuesday. The engineering team has been working on it for months. New tables exist in the database: product_skus, pricing_tiers_v2, and new_product_orders. The application is live, customers are buying.

At 2 PM, someone asks the sales agent: “How is the new product performing?” The agent consults the semantic layer. The semantic layer has no definition for the new product’s revenue. Its data source mappings don’t include the new tables. The metric for “total company revenue” doesn’t account for the new product’s orders.

The agent can’t answer the question. Or worse, it returns total company revenue that doesn’t include any new product sales, because the semantic layer’s definition of revenue only sums from the existing orders table. It gives a correct-looking number that’s misleading.

The semantic layer won’t be updated until the data team runs their weekly definitions refresh, or until someone manually adds the new product. In the meantime, every agent query about revenue is wrong.

Schema Migration

The engineering team renames user_id to account_id across several tables as part of a normalization effort. They deploy on Wednesday evening. The semantic layer’s metric definitions still reference user_id. Every query that relies on joining through user_id breaks.

In a BI context, this would surface as a broken dashboard, and a human would notice and fix it. But agents don’t see “broken.” They see a query that fails and either return an error or try a different approach, possibly joining on the wrong column. The failure mode is unpredictable.

With a batch semantic layer, the definitions won’t be updated until someone notices the breakage, diagnoses the cause, and manually updates the metric definitions. That could take hours or days.

Acquisition Integration

Your company acquires a competitor. On day one, the combined customer base exists in two separate systems with two separate customers tables. The semantic layer’s definition of “total customers” counts only your original table.

For weeks or months while the systems are being integrated, agents answering questions about customer counts are wrong by the entire acquired customer base. The semantic layer says the company has 10,000 customers when it actually has 23,000. Reports generated by agents during this period are useless.

Business Rule Changes

The finance team decides that free-tier users should now be counted as “customers” for reporting purposes (they previously weren’t). They update their internal documentation and notify the BI team. The BI team updates the semantic layer definition during their next sprint.

For the two weeks between the policy change and the semantic layer update, agents are using the old definition. Any agent-generated report about customer counts contradicts what the finance team considers correct. Both the agent and the finance team think they’re right. This kind of ambiguity is exactly what semantic layers were supposed to prevent.

The Streaming Semantic Layer Concept

The solution is a semantic layer that updates in real time as the underlying data evolves. Instead of rebuilding definitions on a batch schedule, a streaming semantic layer receives change events and adjusts continuously.

Here’s what that looks like for each failure mode:

New tables detected automatically. When CDC captures a schema change that creates new_product_orders, the streaming semantic layer receives a notification. It can flag the new table for review, suggest adding it to the revenue calculation, or automatically incorporate it if auto-mapping rules are configured.

Schema changes propagated instantly. When user_id becomes account_id, the CDC stream includes a schema evolution event. The streaming semantic layer updates its column references automatically. Queries that previously joined on user_id now join on account_id. No manual intervention required.

Data source mappings that expand dynamically. When a new data source comes online (acquisition integration, new third-party connector, additional microservice), the streaming semantic layer detects the new tables and makes them available for metric definitions. It doesn’t require a human to manually register every new table.

Business rule propagation. This is the hardest part, because business rules are inherently human decisions. A streaming semantic layer can’t automatically know that free-tier users should now count as customers. But it can detect that the underlying data has changed (free-tier users started appearing in customer-count queries) and flag the discrepancy for review. It surfaces inconsistencies faster than a batch process would.

The Technical Architecture

A streaming semantic layer sits at the intersection of three systems:

CDC pipelines capture both data changes and schema changes from source databases. Streamkap’s CDC from PostgreSQL, MySQL, and MongoDB captures not just row-level INSERT/UPDATE/DELETE events but also DDL changes: new columns, dropped columns, renamed tables, type changes. These schema evolution events are first-class events in the streaming pipeline.

A definition store holds the current state of all metric definitions, data source mappings, business rules, and semantic descriptions. This is similar to what dbt metrics or Cube provide today, but with an additional layer that processes incoming change events and updates definitions dynamically.

An agent-facing API (MCP server, tool definitions, REST/GraphQL endpoint) lets agents query both the definitions and the underlying data. When an agent asks about revenue, it gets back: the definition, the calculation logic, the source tables, the current freshness of each table, and optionally the computed result.

The data flow is:

  1. Source database changes (data or schema)
  2. CDC captures the change event
  3. Data changes flow to the warehouse/analytical store
  4. Schema changes flow to the semantic layer’s definition store
  5. The definition store updates mappings, flags new entities, adjusts column references
  6. Agents query the semantic layer and get current definitions plus current data

What Exists Today vs. What’s Coming

Let’s be honest about the state of the art: nobody has shipped a complete streaming semantic layer in 2026. The concept is emerging from the convergence of two mature trends, but the specific integration is new.

What exists today:

  • Semantic layers (dbt metrics, Cube, AtScale) provide batch-updated definitions that work well for BI
  • CDC platforms (Streamkap, Debezium) provide real-time data changes and schema evolution events
  • Agent frameworks (LangChain, CrewAI, custom MCP servers) provide the agent-to-data interface

What’s missing:

  • Automated routing of schema evolution events from CDC to semantic layer definitions
  • Dynamic metric definition updates triggered by data changes
  • Agent-optimized query interfaces that combine definitions and data in a single call
  • Freshness metadata attached to every semantic layer response

The pieces exist. The integration doesn’t, not as a product you can buy today. But the direction is obvious, and companies building agent-powered applications are assembling these pieces manually.

Why This Matters for Your Agent Strategy

If you’re building AI agents that interact with company data, the semantic layer question will find you whether you plan for it or not. The first time an agent gives a wrong revenue number because it didn’t know your company’s definition, or the first time it answers a question about a product that doesn’t exist in its metadata, you’ll be forced to address it.

The proactive approach is to start with two foundations:

  1. Real-time data infrastructure. Get CDC running from your production databases to your analytical stores. This gives you fresh data and, critically, schema evolution events that will feed a streaming semantic layer when you build one. Streamkap provides this out of the box.

  2. Structured metric definitions. Even if you start with a simple YAML file listing your top 20 metrics and their definitions, you’re ahead of most companies. Make this file queryable by agents (serve it via an MCP server or include it in the agent’s tool definitions).

These two foundations, fresh data and structured definitions, give you 80% of the value. The full streaming semantic layer, with automatic schema propagation and dynamic definition updates, is the remaining 20% that will develop as the tooling matures.

The worst approach is to wait. Every day agents operate without a semantic layer, they accumulate incorrect answers, erode user trust, and make the eventual fix harder because now you’re fighting against established skepticism.

Start with what you have. Stream everything. Define what matters. The streaming semantic layer will follow.


Ready to build the real-time data foundation for your agents? Streamkap provides CDC from PostgreSQL, MySQL, and MongoDB with automatic schema evolution, giving you the fresh data and metadata events that streaming semantic layers require. Start a free trial or explore Streamkap’s AI and agent capabilities.