<--- Back to all resources
Flink SQL MATCH_RECOGNIZE: Complex Event Processing with SQL
Learn how to detect complex event patterns in streaming data using Flink SQL's MATCH_RECOGNIZE clause. Build fraud detection, anomaly detection, and sequence matching with SQL.
Streaming data is everywhere - transactions flowing through payment systems, sensor readings from factory floors, clickstreams across web applications. Individually, these events are unremarkable. But when specific sequences of events occur in a particular order, they tell a story: a fraud attempt in progress, a machine about to fail, or a user on the verge of converting.
Complex Event Processing (CEP) is the discipline of detecting these meaningful patterns within continuous event streams. Traditionally, building CEP required custom application code using specialized libraries. Flink SQL’s MATCH_RECOGNIZE clause changes that equation entirely, bringing regex-like pattern matching directly into SQL. If your team already knows SQL, you can now write sophisticated pattern detection logic without a single line of Java or Scala.
MATCH_RECOGNIZE Syntax Breakdown
The MATCH_RECOGNIZE clause is appended to a standard FROM clause and contains several sub-clauses that work together. Here is the full structure:
SELECT *
FROM MyStream
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY event_time
MEASURES
A.event_time AS start_time,
LAST(B.event_time) AS end_time,
A.price AS start_price,
LAST(B.price) AS end_price
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B+ C)
DEFINE
A AS A.price > 100,
B AS B.price > LAST(B.price, 1) OR B.price > A.price,
C AS C.price < LAST(B.price)
)
Each sub-clause plays a distinct role:
- PARTITION BY divides the stream into independent groups. Pattern matching operates within each partition separately, just like a window function’s
PARTITION BY. In the example above, patterns are detected per user. - ORDER BY establishes the logical ordering of rows within each partition. For event streams, this is almost always an event timestamp. Flink requires this to know which row comes “before” or “after” another.
- MEASURES defines the output columns. You extract values from matched rows using the pattern variable names (A, B, C) and aggregate or navigation functions. This is what you actually see in your query results.
- PATTERN specifies the sequence of row pattern variables using regex-like syntax.
(A B+ C)means: one row matching variable A, followed by one or more rows matching variable B, followed by one row matching variable C. - DEFINE provides the boolean conditions that determine whether a given row qualifies as a particular pattern variable. Any variable not defined in DEFINE matches every row (acting as a wildcard).
This separation of concerns - what to look for (PATTERN), how to identify each step (DEFINE), and what to output (MEASURES) - makes MATCH_RECOGNIZE both powerful and readable.
Pattern Variables and Quantifiers
Pattern variables are the building blocks of your match expression. Each variable name (A, B, C, or any identifier you choose) represents a category of row that you define with a boolean condition.
Quantifiers
Quantifiers control how many consecutive rows a variable can match:
| Quantifier | Meaning | Example |
|---|---|---|
| (none) | Exactly one row | A matches one row |
* | Zero or more rows | A* matches zero or more consecutive rows |
+ | One or more rows | A+ matches one or more consecutive rows |
? | Zero or one row | A? matches optionally one row |
{n} | Exactly n rows | A{3} matches exactly three rows |
{n,m} | Between n and m rows | A{2,5} matches two to five rows |
{n,} | At least n rows | A{3,} matches three or more rows |
{,m} | At most m rows | A{,4} matches up to four rows |
By default, quantifiers are greedy - they consume as many rows as possible while still allowing the overall pattern to match. Appending a ? to any quantifier makes it reluctant (lazy), consuming as few rows as possible. For example, A+? matches the minimum number of rows that still satisfies the pattern.
Combining Variables
Variables can be combined in sequences, alternations, and groups:
-- Sequence: A then B then C
PATTERN (A B C)
-- Alternation: A followed by either B or C
PATTERN (A (B | C))
-- Grouped quantifier: A then one or more (B followed by C) pairs
PATTERN (A (B C)+)
This composability lets you express arbitrarily complex event sequences using familiar regex concepts.
Row Pattern Navigation Functions
Inside both the DEFINE and MEASURES clauses, you can reference values from other rows in the current match using navigation functions:
LAST(variable.column, n)- Returns the value of a column from the nth-to-last row mapped to the variable.LAST(B.price, 0)is the current B row’s price;LAST(B.price, 1)is the previous B row’s price.FIRST(variable.column, n)- Returns the value from the nth row mapped to the variable, counting from the first match.PREV(column, n)- Returns the value from the row that is n positions before the current row in the partition ordering, regardless of pattern variable assignment.NEXT(column, n)- Returns the value from the row n positions after the current row. Note thatNEXTis only available in certain contexts and not universally supported.
These functions are essential for writing conditions that compare the current row to previous rows. For instance, detecting a price increase requires comparing the current price to the previous price:
DEFINE
UP AS UP.price > PREV(UP.price, 1)
Output Modes
MATCH_RECOGNIZE supports two output modes that control how many rows are emitted for each successful match:
ONE ROW PER MATCH
Produces a single summary row per match. This is the most common mode and works well when you want aggregated results - for example, the start time, end time, and total value of a matched pattern.
ONE ROW PER MATCH
MEASURES
FIRST(A.event_time) AS pattern_start,
LAST(C.event_time) AS pattern_end,
COUNT(B.amount) AS num_events
ALL ROWS PER MATCH
Produces one output row for every input row that participated in the match. This is useful for debugging or when you need to tag each individual event with its role in the matched pattern.
ALL ROWS PER MATCH
MEASURES
CLASSIFIER() AS pattern_variable,
MATCH_NUMBER() AS match_id
The CLASSIFIER() function returns the name of the pattern variable that the current row was mapped to (e.g., ‘A’, ‘B’, or ‘C’), and MATCH_NUMBER() returns a sequential identifier for each match.
After Match Strategy
Once a pattern is matched, Flink needs to know where to resume scanning for the next match. The AFTER MATCH clause controls this behavior:
SKIP PAST LAST ROW(default) - Resumes scanning from the row after the last row in the completed match. Matched rows are never part of another match.SKIP TO NEXT ROW- Resumes scanning from the row immediately after the first row of the completed match. This allows overlapping matches, which is important when patterns can share starting events.SKIP TO FIRST variable- Resumes from the first row mapped to the specified variable.SKIP TO LAST variable- Resumes from the last row mapped to the specified variable.
Choosing the right strategy depends on your use case. For non-overlapping alerts (like fraud notifications), SKIP PAST LAST ROW prevents duplicate alerts. For continuous monitoring where every potential match matters (like price pattern scanning), SKIP TO NEXT ROW ensures no pattern is missed.
Practical Examples
Detecting Price V-Shaped Patterns
A classic financial pattern: price drops for several consecutive ticks, then recovers. This V-shape can signal a buying opportunity or a market correction.
SELECT *
FROM stock_ticks
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tick_time
MEASURES
START_ROW.tick_time AS start_time,
LAST(DOWN.tick_time) AS bottom_time,
LAST(UP.tick_time) AS recovery_time,
START_ROW.price AS start_price,
LAST(DOWN.price) AS bottom_price,
LAST(UP.price) AS recovery_price
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (START_ROW DOWN+ UP+)
DEFINE
DOWN AS DOWN.price < PREV(DOWN.price, 1),
UP AS UP.price > PREV(UP.price, 1)
) AS matched;
This query partitions by stock symbol, finds sequences where prices decline for one or more ticks and then rise for one or more ticks, and outputs the timing and prices at each phase of the V-shape.
Fraud Detection: Rapid Small Transactions Followed by Large Withdrawal
Financial fraud often follows a recognizable pattern: several small “test” transactions in quick succession, followed by a large withdrawal. MATCH_RECOGNIZE makes this straightforward to express:
SELECT *
FROM transactions
MATCH_RECOGNIZE (
PARTITION BY account_id
ORDER BY txn_time
MEASURES
FIRST(SMALL.txn_time) AS first_small_txn,
LAST(SMALL.txn_time) AS last_small_txn,
BIG.txn_time AS large_txn_time,
BIG.amount AS large_amount,
COUNT(SMALL.amount) AS num_small_txns
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (SMALL{3,} BIG)
DEFINE
SMALL AS SMALL.amount < 5.00
AND SMALL.txn_time > PREV(SMALL.txn_time, 1) - INTERVAL '10' MINUTE,
BIG AS BIG.amount > 500.00
AND BIG.txn_time < LAST(SMALL.txn_time) + INTERVAL '30' MINUTE
) AS fraud_alert;
The pattern requires at least three small transactions (under $5) occurring within 10 minutes of each other, followed by a large transaction (over $500) within 30 minutes of the last small one. When this pattern matches, the query emits an alert with all the relevant details.
User Funnel Analysis: Signup, Activation, Purchase
Product teams often need to identify users who complete a specific journey through their application. MATCH_RECOGNIZE can detect this funnel in real time:
SELECT *
FROM user_events
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY event_time
MEASURES
SIGNUP.event_time AS signup_time,
ACTIVATE.event_time AS activation_time,
PURCHASE.event_time AS purchase_time,
PURCHASE.event_time - SIGNUP.event_time AS time_to_convert
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (SIGNUP ANY_EVENT* ACTIVATE ANY_EVENT* PURCHASE)
DEFINE
SIGNUP AS SIGNUP.event_type = 'signup',
ACTIVATE AS ACTIVATE.event_type = 'activation',
PURCHASE AS PURCHASE.event_type = 'purchase',
ANY_EVENT AS ANY_EVENT.event_type NOT IN ('signup', 'activation', 'purchase')
) AS funnel;
The ANY_EVENT* variable acts as a wildcard, allowing any number of intermediate events between the funnel stages. This captures the full conversion journey regardless of how many other actions the user takes between steps.
IoT Sensor Alert: Temperature Rising Above Threshold
Manufacturing and infrastructure monitoring often requires detecting sustained anomalies - not just a single spike, but a pattern that indicates a real problem:
SELECT *
FROM sensor_readings
MATCH_RECOGNIZE (
PARTITION BY sensor_id
ORDER BY reading_time
MEASURES
FIRST(HOT.reading_time) AS alert_start,
LAST(HOT.reading_time) AS alert_end,
COUNT(HOT.temperature) AS consecutive_readings,
AVG(HOT.temperature) AS avg_temperature
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (NORMAL HOT{5,})
DEFINE
NORMAL AS NORMAL.temperature <= 80.0,
HOT AS HOT.temperature > 80.0
) AS overheat_alert;
This detects five or more consecutive readings above 80 degrees following a normal reading. The normal reading anchors the start of the pattern, ensuring that the alert fires at the transition from safe to dangerous conditions rather than repeatedly within an ongoing overheat event.
Performance Considerations
MATCH_RECOGNIZE maintains state for every active partial match. In high-throughput streams, this state can grow significantly, so understanding the performance implications is important.
Pattern complexity drives state size. Patterns with unbounded quantifiers (A+, B*) on high-cardinality partitions can accumulate large amounts of state as Flink tracks many open partial matches simultaneously. Where possible, use bounded quantifiers (A{1,10}) to cap the maximum match length.
Timeouts prevent unbounded state growth. If a pattern never completes, the partial match state grows indefinitely. Flink does not natively support pattern timeouts in MATCH_RECOGNIZE the way the CEP library does, so you should design your DEFINE conditions to include temporal constraints. For example, adding a condition like B.event_time < A.event_time + INTERVAL '1' HOUR ensures that a partial match is abandoned if the time window elapses.
Partition cardinality matters. Each distinct partition key maintains its own pattern matching state. Partitioning by a high-cardinality field (like individual user IDs across millions of users) means millions of independent state machines. Consider whether a coarser partitioning scheme works for your use case.
After match strategy affects throughput. SKIP TO NEXT ROW creates more potential matches than SKIP PAST LAST ROW because it evaluates overlapping windows. For alerting use cases where you only need the first match, SKIP PAST LAST ROW is more efficient.
When running MATCH_RECOGNIZE queries at scale, a managed Flink platform like Streamkap handles the underlying state management, checkpointing, and resource allocation automatically, so you can focus on writing correct patterns rather than tuning infrastructure.
MATCH_RECOGNIZE vs Window Functions
Both MATCH_RECOGNIZE and window functions operate on ordered sequences of rows, but they solve fundamentally different problems.
Window functions compute aggregates over a fixed frame of rows. They answer questions like “What is the moving average of the last 10 prices?” or “What is the maximum temperature in the last hour?” The frame is defined by size or time, and every row in the window contributes to the result.
MATCH_RECOGNIZE detects variable-length sequences that satisfy structural constraints. It answers questions like “Did the price drop three times in a row and then spike?” The number of rows in a match is not predetermined - it depends on the data. The pattern may span 4 rows or 40 rows depending on how the events unfold.
Here is a practical comparison:
| Criterion | Window Functions | MATCH_RECOGNIZE |
|---|---|---|
| Frame definition | Fixed size or time interval | Variable, determined by pattern |
| Row selection | All rows in frame | Only rows matching DEFINE conditions |
| Output | One result per input row | One result per match (or per matched row) |
| Ordering dependency | Aggregates within frame | Detects sequences across rows |
| Use case | Running averages, rankings, lag/lead | Fraud patterns, session sequences, trend detection |
If your question is about aggregating within a known window, use window functions. If your question is about detecting a specific sequence of conditions whose length is unknown in advance, MATCH_RECOGNIZE is the right tool.
In many real-world pipelines, you will use both. A window function might compute a rolling average, and MATCH_RECOGNIZE might detect when that rolling average follows a specific trajectory - combining the strengths of both approaches within Streamkap’s managed Flink environment.
Wrapping Up
MATCH_RECOGNIZE brings the full power of Complex Event Processing to teams that think in SQL. Instead of writing imperative pattern detection code in Java or Scala, you declare the pattern you are looking for, define what each step means, and let Flink’s runtime handle the matching, state management, and output. The result is pattern detection logic that is readable, maintainable, and testable - three qualities that matter when that logic is standing between your business and fraud, equipment failure, or missed revenue.