<--- Back to all resources
Streaming Materialized Views: Always-Fresh Query Results
How to build materialized views that update in real time using CDC and stream processing. Eliminate stale data without periodic batch refreshes.
Every data team eventually hits the same wall: queries that are too slow to run on demand, so you pre-compute the results into a materialized view. The view is fast to read, but it goes stale the moment the underlying data changes. You set up a cron job to refresh it every hour, or every fifteen minutes if you are feeling aggressive. And then someone asks why the dashboard does not match the numbers they just saw in the source system.
This is the staleness problem, and it is built into how traditional materialized views work. Streaming materialized views solve it by flipping the refresh model from pull to push. Instead of periodically re-running a query against the full dataset, you process each change as it happens and incrementally update the pre-computed result. The view is always current, and there is no refresh job to manage.
The Problem with Periodic Refreshes
A standard materialized view in PostgreSQL or any relational database is a snapshot. You define a query, run REFRESH MATERIALIZED VIEW, and the database stores the result set. Reads against the view are fast because the data is already computed. But the view is frozen at the moment of the last refresh.
This creates several practical problems:
- Staleness windows. If you refresh every hour, your data can be up to 59 minutes old. For operational dashboards, this is often unacceptable.
- Refresh cost. Each refresh re-executes the full query. For complex joins across large tables, this can take minutes or longer - consuming significant database resources.
- Concurrency locks. In PostgreSQL,
REFRESH MATERIALIZED VIEWtakes an exclusive lock by default.REFRESH MATERIALIZED VIEW CONCURRENTLYavoids full locks but requires a unique index and still consumes resources. - Scheduling complexity. You need a cron job or orchestrator to trigger refreshes. When it fails silently, stale data persists until someone notices.
The pattern works fine when near-real-time freshness is not required. But as more teams build operational dashboards, real-time APIs, and event-driven microservices, the gap between “last refreshed” and “right now” becomes a real problem.
What Is a Streaming Materialized View?
A streaming materialized view is a pre-computed query result that updates incrementally and continuously as the source data changes. Rather than re-running the full query on a timer, a stream processor watches the change stream from the source database and applies each insert, update, or delete to the materialized result.
The core idea: treat the materialized view not as a periodic snapshot, but as the output of a continuous query over a change stream.
This requires three things:
- A change feed from the source. This is where change data capture (CDC) comes in. CDC reads the database’s transaction log and emits a stream of row-level changes.
- A stream processor to run the continuous query. Apache Flink is the most common choice here, particularly Flink SQL, which lets you express the view logic in standard SQL.
- A destination store for the results. The continuously updated view needs to live somewhere that your application can read it - PostgreSQL, Redis, Elasticsearch, or any other store that fits the access pattern.
CDC as the Change Feed
Change data capture is what makes the entire pattern possible. Without a reliable, ordered stream of every change from the source database, there is no way to keep a downstream view incrementally updated.
CDC works by reading the database’s write-ahead log (WAL in PostgreSQL, binlog in MySQL, oplog in MongoDB). Each committed transaction produces a set of change events - one per affected row - containing the operation type (insert, update, delete) and the row data before and after the change.
Streamkap handles the CDC pipeline end to end: connecting to your source database, reading the transaction log, and delivering change events to downstream consumers. This removes the need to self-manage Debezium connectors and Kafka infrastructure just to get a change stream flowing.
The change events from CDC form the input to your continuous query. Every time a row changes in the source, a corresponding event flows through the stream processor, which updates the materialized view. The latency from source change to view update is typically measured in seconds.
Implementing with Flink SQL Continuous Queries
Flink SQL is where the streaming materialized view pattern becomes practical. Flink’s continuous query model means that a SQL query submitted to Flink does not run once and return results - it runs forever, processing each new event as it arrives and emitting updated results.
Here is a concrete example. Suppose you have an orders table in PostgreSQL and you want a materialized view that shows revenue per product category, updated in real time.
First, define the source table in Flink SQL, backed by a CDC connector:
CREATE TABLE orders (
order_id BIGINT,
product_category STRING,
amount DECIMAL(10, 2),
order_status STRING,
created_at TIMESTAMP(3),
PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'topic' = 'cdc.public.orders',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'debezium-json'
);
Then, define the continuous aggregation query and write the results to a sink:
CREATE TABLE revenue_by_category (
product_category STRING,
total_revenue DECIMAL(10, 2),
order_count BIGINT,
last_updated TIMESTAMP(3),
PRIMARY KEY (product_category) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:postgresql://localhost:5432/analytics',
'table-name' = 'revenue_by_category'
);
INSERT INTO revenue_by_category
SELECT
product_category,
SUM(amount) AS total_revenue,
COUNT(*) AS order_count,
MAX(created_at) AS last_updated
FROM orders
WHERE order_status = 'completed'
GROUP BY product_category;
This query runs continuously. When a new order arrives, Flink updates the running sum and count for that category and emits an updated row. The JDBC sink writes it as an upsert to the destination PostgreSQL table. Your dashboard or API reads from revenue_by_category and always gets the latest numbers - no refresh job needed.
Multi-Table Joins
Streaming materialized views are not limited to single-table aggregations. Flink SQL supports joins across multiple CDC streams. For example, you can join orders with customers and products to produce a denormalized view:
INSERT INTO enriched_orders
SELECT
o.order_id,
c.customer_name,
c.customer_segment,
p.product_name,
p.category,
o.amount,
o.created_at
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;
Flink maintains the join state internally. When a customer’s name changes, Flink re-emits the joined rows for all of that customer’s orders with the updated name. This is the power of stateful stream processing applied to view maintenance.
Incremental vs. Full Refresh Semantics
The fundamental difference between traditional and streaming materialized views comes down to how changes propagate.
Full refresh re-executes the entire query against the full dataset. It is simple to reason about - the result is always a complete, correct snapshot - but the cost scales with the size of the data, not the volume of changes.
Incremental maintenance applies only the delta. When a single row changes, the stream processor updates only the affected portion of the result. A new order adds to one category’s sum. A status change might remove a row from a filtered view and add it to another.
Incremental maintenance is dramatically more efficient when the change rate is small relative to the total dataset. If you have 100 million orders and 1,000 new orders per minute, full refresh reprocesses all 100 million rows. Incremental processing handles only the 1,000 changes.
However, incremental maintenance introduces complexity around correctness. The stream processor must handle:
- Retractions. When a row is updated or deleted, the old contribution to the aggregate must be subtracted. Flink handles this with its retraction mechanism - it emits both a retract message (undo the old value) and an accumulate message (apply the new value).
- Late arrivals. Events that arrive out of order must still produce correct results. Flink’s watermark and event-time processing handle this.
- State management. The processor must keep enough state to compute incremental updates. For a
GROUP BYaggregation, this means maintaining the running aggregate per key. For joins, it means keeping both sides of the join in memory.
Changelog Output and Sink Strategies
Flink continuous queries produce a changelog stream as output, not just a result set. Each output record is tagged as an insert, update, or delete. How you handle this changelog depends on the destination store.
Upsert to a relational database (PostgreSQL, MySQL). The JDBC sink uses the primary key to perform INSERT ... ON CONFLICT DO UPDATE. Each changelog record becomes an upsert. This is the most common pattern for serving views to APIs and dashboards.
Write to Redis. Use the primary key as the Redis key and serialize the row as JSON or a hash. Inserts and updates overwrite the key. Deletes remove it. Redis gives you sub-millisecond read latency, which is ideal for API endpoints that need to serve pre-aggregated data at high throughput.
Index in Elasticsearch. Write each changelog record as a document upsert. This is the right choice when the view needs full-text search or complex filtering that a relational store cannot serve efficiently.
Append to a data warehouse (Snowflake, BigQuery, ClickHouse). If you need the streaming view for analytics, you can sink the changelog to a warehouse. Some warehouses support merge operations; others work better with an append-only model where you materialize the latest state at query time.
Streamkap supports writing CDC data to all of these destinations, and when combined with Flink for in-stream transformations, the full pipeline from source change to updated materialized view runs end to end without custom glue code.
Practical Use Cases
Real-Time Dashboards
The most obvious use case. Instead of a dashboard that shows data from the last hourly refresh, every chart and metric reflects the current state. An operations team monitoring order volume, error rates, or inventory levels sees changes as they happen.
The streaming materialized view pre-computes the aggregations (counts, sums, percentiles by time window) so the dashboard query is a simple SELECT against an already-computed table. There is no heavy aggregation at read time.
API Endpoints Serving Pre-Aggregated Data
Many APIs serve data that is derived from multiple tables - a user profile endpoint that includes order count, last login, subscription status, and account balance. Building this from normalized tables at request time means multiple joins and aggregations per API call.
A streaming materialized view pre-computes the denormalized result. The API reads a single row from the view table. Response times drop from hundreds of milliseconds to single digits, and the database load drops proportionally.
Search Indexes That Stay Current
If your application has a search feature backed by Elasticsearch or OpenSearch, keeping the index up to date is a constant challenge. A streaming materialized view that joins product data with inventory levels, pricing, and reviews - and sinks the result to Elasticsearch - means the search index is always current. No more “product shows as in stock in search but is actually sold out.”
Microservice Read Models (CQRS)
In a CQRS architecture, the read model is fundamentally a materialized view. Streaming materialized views are a natural fit here. The write side commits to the source database. CDC captures the changes. Flink transforms and denormalizes the data. The result lands in a read-optimized store that the query side of the service reads from.
This decouples the read and write models without requiring application-level event publishing. The database’s transaction log is the source of truth, and CDC ensures nothing is missed.
Operational Considerations
Running streaming materialized views in production requires thinking about a few things that do not apply to batch refresh jobs.
Resource consumption is continuous. A batch refresh uses resources in spikes. A streaming view consumes CPU and memory steadily. The total resource usage is usually lower, but it is always-on. Plan your Flink cluster sizing accordingly.
State size grows with cardinality. A GROUP BY on a column with 1,000 distinct values maintains 1,000 state entries. A join between two tables with millions of rows can consume significant memory. Monitor Flink’s state backend metrics and consider RocksDB state backend for large state.
Schema evolution needs handling. When the source schema changes - a new column, a renamed field, a type change - the CDC stream and the Flink job must both adapt. Streamkap’s schema change management handles propagation of schema changes through the pipeline, reducing the manual coordination required.
Exactly-once semantics matter. For financial aggregations or any view where double-counting is unacceptable, you need exactly-once processing guarantees. Flink provides this through its checkpointing mechanism combined with transactional sinks. Make sure your sink connector supports two-phase commit if you need strict correctness.
From Batch Refresh to Continuous Maintenance
Migrating from traditional materialized views to streaming ones does not have to be all-or-nothing. A practical approach:
- Identify the views with the tightest freshness requirements. Start with the ones where stakeholders already complain about stale data.
- Set up CDC on the source tables. Streamkap can get a CDC pipeline running in minutes without changes to the source application.
- Write the Flink SQL continuous query. In many cases, you can reuse the existing materialized view SQL with minor modifications for Flink’s streaming semantics.
- Sink to the same destination. Write the continuously updated results to the same table the old view occupied. Downstream consumers do not need to change.
- Remove the cron job. Once you have validated that the streaming view produces correct results, turn off the periodic refresh.
The result is a view that is always fresh, costs less to maintain over time, and removes an entire class of “data is stale” support tickets from your backlog. That is a trade worth making for any team that has outgrown batch refresh windows.