Why can't AI agents just query the source database directly?

They can, but it creates serious problems at scale. Direct queries from agents add unpredictable load to production databases, competing with the application for resources. If 50 agents each poll the database every few seconds, the query load can degrade application performance. CDC avoids this entirely by reading the transaction log, which is a passive operation that adds near-zero load to the source. The agents read from downstream data stores that are purpose-built for their access patterns.

How stale is batch data in practice?

It depends on the batch interval. Hourly batch jobs produce data that is 0 to 60 minutes stale, with an average staleness of 30 minutes. Daily batch jobs average 12 hours of staleness. Many organizations run nightly ETL, meaning the data agents work with could be up to 24 hours old. For comparison, CDC-based streaming delivers data that is typically 1 to 5 seconds old.

Is batch data ever acceptable for AI agents?

Yes, for some use cases. Agents that generate weekly reports, analyze historical trends, or perform tasks where timeliness does not affect correctness can work with batch data. The problem is agents that make real-time decisions: fraud detection, dynamic pricing, inventory management, customer support. Any agent whose decisions depend on the current state of a system needs streaming data.

What is the cost of an agent making decisions on stale data?

It depends on the decision type, but the math is straightforward. If a fraud detection agent processes 500 transactions per hour and the data is 6 hours stale, it is making 3,000 decisions on potentially incorrect information per staleness cycle. Even a 1% error rate from stale data means 30 wrong decisions per cycle, which could mean approved fraud or blocked legitimate transactions. At scale, these errors compound and create real financial and customer experience damage.

How do you transition from batch ETL to streaming for agent workloads?

Start with the most time-sensitive agent use case. Set up CDC on the source databases that agent needs, stream the changes through a processing layer (Flink SQL for transformations), and write to the data store the agent queries. You do not need to replace your entire batch infrastructure overnight. Most teams run streaming and batch in parallel, gradually migrating agent workloads to streaming as they prove the pattern works.

Why AI Agents Can't Use Batch Data

There is a specific reason AI agents fail in production, and it is not the model, the prompt, or the framework. It is the data.

Most teams building agents connect them to the same data infrastructure they already have: batch ETL pipelines that load warehouses on a schedule. This works for dashboards. It does not work for agents. The failure is not gradual. It is immediate, measurable, and it gets worse with every decision the agent makes.

The Core Problem: Agents Decide at Machine Speed on Human-Speed Data

A batch ETL pipeline was designed for a specific consumer: a human analyst who opens a dashboard, looks at some charts, thinks about what they mean, and eventually makes a decision. That workflow tolerates stale data because the human is the slowest part of the loop. Whether the dashboard refreshes every hour or every six hours, the analyst probably checks it once or twice a day.

An AI agent is not a human analyst. It does not check a dashboard once a day. It makes decisions continuously, autonomously, at the speed of an API call. A fraud detection agent might evaluate 500 transactions per hour. A pricing agent might adjust prices on 10,000 SKUs per hour. A customer support agent might handle 200 tickets per hour.

Every one of those decisions uses data. And if that data came from a batch load that ran six hours ago, every one of those decisions is potentially wrong.

Specific Failure Modes

Let’s stop talking in abstractions and walk through what actually goes wrong.

The Fraud Agent with Stale Balances

Your fraud detection agent evaluates transactions against account balances, recent transaction history, and risk scores. The batch ETL job runs every six hours. At 2pm, the agent is working with balance data from the 8am load.

Between 8am and 2pm, a customer deposited $5,000 and then initiated a $4,800 wire transfer. The agent’s data shows the pre-deposit balance. It flags the wire transfer as suspicious because it appears to exceed the available balance. The customer’s legitimate transaction gets blocked. They call support. Your team investigates. The data was stale.

Now flip it. A bad actor drains an account at 9am. The agent does not see the zero balance until the 2pm load. For five hours, it approves transactions against a balance that no longer exists.

Both scenarios are caused by the same thing: the agent decided on data that did not reflect reality.

The Inventory Agent with a Morning Snapshot

Your inventory management agent reorders stock when levels drop below threshold. It works from an inventory snapshot loaded at 6am.

By 10am, a flash sale has depleted three product lines. The agent does not know. It will not know until the next batch load at noon or 6pm, depending on your schedule. For hours, the agent either fails to reorder (causing stockouts) or continues to accept orders for products that are gone.

Meanwhile, a return shipment arrived at 8am and restored stock for another product line. The agent still sees the pre-return levels and places an unnecessary reorder, tying up capital in excess inventory.

The Support Agent with Yesterday’s Orders

Your customer support agent handles tickets by looking up order history, shipping status, and account details. The data loads nightly.

A customer contacts support at 3pm about an order they placed at 9am. The agent has no record of the order. It does not exist in the agent’s data. The agent either tells the customer it cannot find their order (bad experience) or escalates to a human (defeating the purpose of the agent).

This is not an edge case. It is the default experience for any customer who contacts support about something that happened today.

The Math of Stale Decisions

The cost of stale data scales linearly with decision volume. This is the part that makes batch data particularly dangerous for agents.

A human analyst making 10 decisions a day on six-hour-old data might get one or two wrong. The damage is contained because the volume is low.

An agent making 1,000 decisions per hour on six-hour-old data has a fundamentally different risk profile. Let’s be conservative and assume that only 2% of decisions are materially affected by data staleness. That is 20 wrong decisions per hour, 480 per day, 3,360 per week.

Now consider that most organizations are not deploying one agent. They are deploying dozens. Each one is a multiplier on this error rate.

The formula is simple:

Stale decisions per day = (decisions per hour) x (hours of data staleness) x (% affected by staleness) x (hours of operation)

For a modest deployment:

500 decisions/hour
6 hours average staleness
3% staleness impact rate
16 operating hours/day

That is 500 x 0.03 x 16 = 240 incorrect decisions per day, per agent. With 10 agents, that is 2,400 daily decisions made on wrong data.

These are not theoretical numbers. They are what happens when you connect agents to batch infrastructure.

Why Batch Specifically Breaks

It is worth understanding the specific mechanisms that make batch data unsuitable for agents, because “it’s just slow” understates the problem.

Staleness Is Variable and Unpredictable

Batch staleness is not constant. Right after a load completes, data is reasonably fresh. Five minutes before the next load, it is maximally stale. The agent has no way to know where it is in this cycle. It treats all data with equal confidence, whether it is 5 minutes old or 5 hours old.

Streaming CDC delivers every change with a timestamp. The agent (or the infrastructure serving the agent) can know exactly how old each piece of data is and make decisions accordingly.

Deletes Disappear

This is an underappreciated problem. When a row is deleted from a source database, batch ETL often misses it entirely. The standard incremental extraction queries for rows with updated_at > last_run_time. A deleted row has no updated_at because it no longer exists.

For agents, missing deletes is dangerous. A cancelled order still appears active. A deactivated account still shows as valid. A removed product still appears in inventory.

CDC captures deletes as explicit events because the database transaction log records them. The agent sees the cancellation, the deactivation, the removal.

Schema Changes Break the Pipeline

When a column is added, renamed, or retyped in the source database, batch ETL jobs fail. They are typically written against a fixed schema. The failure is silent until the next run, when the job crashes or (worse) silently drops the new column.

For an agent, a missing column might mean a missing feature in its decision logic. If the source added a risk_score column that the agent needs, batch ETL will not propagate it until someone manually updates the ETL job.

CDC-based streaming platforms handle schema evolution automatically, detecting new columns, type changes, and renames from the transaction log and propagating them downstream.

Polling Creates Load

Some teams try to make batch data “fresher” by running jobs more frequently. Hourly instead of daily. Every 15 minutes instead of hourly. Every 5 minutes.

This creates a direct problem: each batch run queries the source database. More frequent runs mean more query load on production systems. At a 5-minute interval, the ETL system is querying the production database 288 times per day per table. For a database with 50 tables, that is 14,400 queries per day just for data extraction.

CDC reads the transaction log, which is an append-only file the database is already writing. There is no additional query load. The frequency of changes is irrelevant because each change is captured individually as it happens.

What the Agent Architecture Should Look Like

The alternative to batch ETL for agent workloads is straightforward:

Source databases write to their transaction logs as usual. No application changes required.

CDC connectors (Debezium, running on a managed platform) read the transaction log and emit change events for every insert, update, and delete. Latency from database commit to event emission is typically under one second.

A streaming backbone (Kafka) provides durable, ordered storage for change events. Events are immutable and replayable if a downstream system needs to rebuild state.

Stream processing (Flink) transforms raw change events into the format agents need: enriching with reference data, filtering irrelevant changes, aggregating related events, computing derived values.

Agent data stores receive the processed events in real time. This might be Redis for low-latency lookups, Elasticsearch for search, a vector database for semantic retrieval, or Snowflake/BigQuery for complex analytical queries. The destination matches the agent’s access pattern.

The result: when a row changes in the source database, the agent’s data store reflects that change within seconds. Not hours. Not minutes. Seconds.

When Batch Is Still Fine

Batch ETL is not going away, and agents do not universally require streaming data. Be honest about when batch works:

Historical analysis agents. An agent that analyzes quarterly sales trends does not need sub-second data. Yesterday’s batch load is perfectly adequate.

Report generation agents. Agents that produce weekly summaries or monthly reports work fine with batch data because their output is inherently periodic.

Training and fine-tuning pipelines. Model training workflows consume large historical datasets. Batch is the natural fit.

The dividing line is simple: if the agent makes decisions that depend on the current state of a system, it needs streaming data. If it analyzes historical patterns, batch is fine.

Most production agent deployments have both kinds of workloads. The mistake is using batch for everything because it is what you already have.

The Transition Is Not All-or-Nothing

You do not need to rip out your batch ETL infrastructure to start streaming data to agents. The practical approach:

Identify the highest-impact agent. Which agent makes the most decisions on the most time-sensitive data?
Set up CDC on its source databases. A managed platform like Streamkap can have CDC running in hours, not weeks.
Stream to the agent’s data store. Write the CDC output to whatever store the agent queries.
Measure the difference. Compare agent accuracy with streaming data vs batch data. The improvement is usually dramatic and obvious.
Expand to other agents. Once the pattern is proven, apply it to additional agent workloads.

Your batch ETL continues running for everything else. The two systems coexist. Over time, as more workloads move to agents, more data flows through streaming. But you do not have to make that transition all at once.

The Bottom Line

Batch ETL was built for a world where the consumer of data was a person who checks a dashboard a few times a day. AI agents are not that consumer. They decide continuously, autonomously, and at a speed that makes stale data actively dangerous.

The infrastructure gap between what agents need (seconds-old data) and what batch provides (hours-old data) is not a minor inconvenience. It is a fundamental mismatch that produces compounding errors at scale.

If you are building agents that make real-time decisions, streaming CDC is not a nice-to-have optimization. It is a prerequisite for the agents working correctly.

Products

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company