<--- Back to all resources
Apache Flink Agents: Event-Driven AI Agents with Streaming Guarantees
Flink Agents bring exactly-once consistency to AI agent orchestration. Learn how event-driven streaming agents differ from batch-oriented frameworks like LangChain and CrewAI.
Why AI Agents Need Streaming
Most AI agent frameworks today are built around a simple pattern: receive a request, run some logic, call an LLM, return a response. This works well for chatbots and one-off tasks. But it falls apart when agents need to react to events in real time, maintain state across millions of concurrent sessions, or guarantee that no event is processed twice.
Consider a fraud detection agent that monitors payment transactions. In a batch-oriented framework, you would poll a database, pull recent transactions, run them through an LLM with tool access, and write results back. This introduces latency, creates gaps between polling intervals, and offers no built-in protection against duplicate processing if the agent crashes mid-batch.
The problem is not the LLM or the tools. The problem is the execution model. Batch-oriented agent frameworks treat data as something to fetch. Streaming frameworks treat data as something that arrives continuously. For agents that need to act on events as they happen, the streaming model is fundamentally better suited.
This is the gap that Apache Flink Agents fills.
What Are Flink Agents?
Flink Agents is an open-source framework for building AI agents that run as Apache Flink streaming jobs. Each agent is a Flink application that consumes events from streaming sources (Kafka topics, CDC streams, HTTP endpoints), processes them through LLM-powered reasoning, calls external tools, and produces output events or side effects.
The framework supports both Python and Java, making it accessible to data engineers and ML engineers alike. Because agents run as Flink jobs, they inherit all of Flink’s operational guarantees: exactly-once processing, automatic checkpointing, state management, and fault recovery.
At its core, a Flink Agent is a stateful stream processor where the processing logic includes LLM calls and tool invocations. This is a significant departure from traditional agent frameworks, which are stateless functions that run on demand.
Architecture Overview
A Flink Agent consists of several components working together:
- Event source - A Flink source connector that reads from Kafka, a CDC stream, an HTTP endpoint, or any supported input. Events trigger agent execution.
- Agent runtime - The core processing logic that receives events, maintains conversation history, calls LLMs, and executes tools. This runs inside Flink’s task managers.
- State backend - Flink’s state management layer (typically RocksDB) stores agent state: conversation history, intermediate results, tool call records, and any domain-specific data the agent needs to remember.
- Tool registry - A set of callable tools the agent can invoke. These can be REST APIs, database queries, other Flink jobs, or custom functions.
- Output sink - Where the agent writes its results. This could be a Kafka topic, a database, an API call, or a downstream Flink pipeline.
The key insight is that the entire agent lifecycle, from event ingestion through LLM reasoning to tool execution and output, happens within Flink’s processing framework. There is no external orchestrator. Flink itself is the orchestrator.
Workflow Agents vs ReAct Agents
Flink Agents supports two distinct agent types, each suited to different problem shapes.
Workflow Agents
Workflow Agents follow a predefined sequence of steps. You define the order of operations at design time: read event, extract fields, call LLM for classification, invoke tool A, then tool B, write result. The agent executes these steps deterministically for every event.
This pattern works well for structured processes where the logic is known in advance. Examples include document processing pipelines, data enrichment workflows, and compliance checks. The deterministic nature of Workflow Agents makes them easier to test, debug, and audit.
Event -> Extract -> Classify (LLM) -> Enrich (Tool) -> Validate (Tool) -> Output
ReAct Agents
ReAct Agents (Reasoning + Acting) are autonomous. Given an event and a set of available tools, the agent reasons about what to do next, executes a tool, observes the result, and decides the next step. The LLM drives the decision-making loop, and the agent continues until it reaches a conclusion or hits a configured step limit.
This pattern suits open-ended tasks where the right sequence of actions depends on intermediate results. A customer service agent that needs to look up order history, check inventory, and decide whether to issue a refund based on what it finds is a natural fit for the ReAct pattern.
Event -> Reason (LLM) -> Act (Tool) -> Observe -> Reason -> Act -> ... -> Output
Both agent types run as Flink jobs and benefit from the same streaming guarantees. The difference is in how processing logic is structured, not in how events are handled.
Exactly-Once Guarantees for Agents
This is where Flink Agents diverge most sharply from other agent frameworks. When a Flink Agent processes an event, the entire operation, including state updates, tool call records, and output messages, is covered by Flink’s exactly-once semantics.
What this means in practice:
- No duplicate processing - If a Flink Agent crashes and restarts, it recovers from its last checkpoint. Events that were already processed are not reprocessed. Tool calls that were already made are not repeated.
- Consistent state - The agent’s conversation history, intermediate results, and domain state are always consistent. There is no window where state can diverge from what was actually processed.
- Atomic output - When using Flink’s two-phase commit sinks (Kafka, for example), output messages are written exactly once, even across failures.
For AI agents that make real-world decisions, such as approving transactions, triggering alerts, or modifying customer records, these guarantees are not optional. A fraud detection agent that processes the same transaction twice could block a legitimate card. A customer service agent that sends duplicate refunds creates financial exposure. Exactly-once processing eliminates these failure modes.
State Management and Checkpointing
Flink Agents store their state in Flink’s managed state backends. This includes:
- Conversation history - The sequence of messages between the agent, the LLM, and tools. This allows agents to maintain context across multiple events from the same entity (user, device, account).
- Tool call records - Which tools were called, with what parameters, and what they returned. Used for auditing and for recovery after failures.
- Domain state - Any application-specific data the agent accumulates. A monitoring agent might track the last 100 readings from a sensor. A customer service agent might store open ticket status.
Flink periodically snapshots this state to durable storage (S3, HDFS, GCS) through its checkpointing mechanism. If the agent fails, it restarts from the most recent checkpoint with all state intact. The agent does not need custom recovery logic. Flink handles it automatically.
This is a major advantage over agent frameworks that store state in external databases or in-memory caches. With Flink, state is co-located with processing, which eliminates network round-trips for state access and guarantees consistency between state and processing progress.
Comparison with LangChain, CrewAI, and AutoGen
| Capability | Flink Agents | LangChain | CrewAI | AutoGen |
|---|---|---|---|---|
| Execution model | Continuous streaming | Request-response | Request-response | Request-response |
| Processing guarantees | Exactly-once | None | None | None |
| State management | Built-in (Flink state) | External (Redis, DB) | External | In-memory |
| Fault recovery | Automatic (checkpoints) | Manual | Manual | Manual |
| Scaling | Flink parallelism | Manual horizontal | Manual | Manual |
| Event sources | Kafka, CDC, HTTP, etc. | Custom polling | Custom polling | Custom polling |
| Language support | Python, Java | Python | Python | Python |
| Best for | Streaming workloads | Chatbots, RAG | Multi-agent teams | Conversational agents |
LangChain, CrewAI, and AutoGen are excellent for interactive applications where a user sends a message and waits for a response. They are not designed for workloads where millions of events arrive per second and each one needs to be processed reliably. Flink Agents fills this gap.
Use Cases
Real-Time Fraud Detection Agents
A Flink Agent consumes payment transaction events from Kafka, maintains per-card state (velocity, location history, spending patterns), and uses an LLM to evaluate suspicious patterns that rule-based systems miss. The agent can call external tools to check blocklists, verify merchant details, or trigger card holds. Exactly-once processing ensures every transaction is evaluated precisely once.
Streaming Customer Service Agents
Events from support channels (chat messages, emails, status changes) flow into a Flink Agent that maintains per-ticket conversation history. The agent classifies issues, looks up relevant account information through tool calls, and either resolves the issue automatically or routes it with full context to a human agent. State checkpointing means no conversation context is lost during deployments or failures.
IoT Monitoring and Alerting Agents
Sensor readings from thousands of devices stream into Flink. An agent per device type maintains rolling statistics, detects anomalies using LLM-based reasoning (for patterns too complex for static thresholds), and triggers alerts or automated responses through tool integrations. Flink’s windowing capabilities allow the agent to reason over time-based aggregations natively.
Getting Started with Flink Agents
Running Flink Agents in production requires operating Flink infrastructure: deploying clusters, configuring state backends, managing checkpoints, tuning parallelism, and handling upgrades. For teams already running Flink, adding agent workloads is a natural extension. For teams new to Flink, the operational overhead can be significant.
This is where Streamkap comes in. Streamkap’s managed streaming platform already runs Flink for real-time CDC and stream processing workloads. Support for Flink Agents is coming soon, which means teams will be able to deploy event-driven AI agents without provisioning or managing Flink infrastructure.
With Streamkap, you get:
- Managed Flink runtime - No cluster management, no JVM tuning, no checkpoint configuration. Deploy your agent code and Streamkap handles the rest.
- Native CDC integration - Connect agents directly to database change streams from PostgreSQL, MySQL, MongoDB, and DynamoDB. Your agents react to data changes as they happen.
- Built-in monitoring - Track agent performance, LLM call latency, tool execution rates, and processing lag through Streamkap’s observability layer.
- Scaling on demand - Streamkap adjusts Flink parallelism based on event volume. Your agents scale up during traffic spikes and scale down during quiet periods.
The combination of Flink Agents’ processing model and Streamkap’s managed infrastructure makes production-grade streaming AI agents accessible to teams of any size. Instead of spending months building and maintaining Flink clusters, you can focus on the agent logic that drives business value.
To learn more about running Flink workloads on Streamkap, visit our Flink processing documentation or start a free trial.