<--- Back to all resources
Stream Filtering and Routing: Sending the Right Data to the Right Place
Learn how to filter, split, and route streaming data to multiple destinations based on content, type, or business rules. Build efficient multi-destination pipelines.
Most streaming pipelines start simple: get data from point A to point B. But the moment you have more than one destination, or your source produces data you do not actually need, that simple pipeline becomes an expensive firehose. You are paying to transfer, store, and process records that nobody asked for.
Stream filtering and routing solve this by letting you control what data goes where before it ever leaves the pipeline. For teams dealing with data residency requirements, PII regulations, or multi-region architectures, this is not optional - it is a hard requirement.
Stream Filtering Patterns
Filtering removes records from a stream before they reach the destination. There are three distinct patterns, each addressing a different problem.
Row-Level Filtering
Row-level filtering evaluates each record against a condition and drops the ones that do not match. Typical use cases include excluding test data (environment = 'staging'), removing internal employee traffic, and discarding records that fail schema validation. The principle is simple: if a record will never be queried or acted upon at the destination, do not send it there.
Table-Level Filtering
In CDC pipelines, the source database often has dozens or hundreds of tables, but most pipelines only need a fraction of them. Without table-level filtering, your connector captures every write to every table. For a database with 200 tables where you only need 15, that is an enormous amount of wasted work.
Column Projection
Column projection strips unnecessary columns from each record. If your analytics pipeline only needs user_id, event_type, and timestamp from a 40-column table, there is no reason to transfer the other 37. This directly lowers network transfer costs and keeps your destination schema clean.
Content-Based Routing
While filtering decides whether a record should exist in the stream at all, routing decides which destination receives it. Content-based routing inspects the content of each record and sends it down a specific path based on field values or business rules.
Routing by Field Values
The most straightforward form uses a field value to determine the destination. A single orders stream split on region sends EU records to an EU warehouse, US records to a US warehouse, and APAC records to an APAC warehouse. This is essential for GDPR and data residency compliance - pipeline-level routing ensures that data never touches infrastructure in the wrong region.
Routing by Event Type
A single CDC stream from an events table might contain signups, purchases, page views, and error logs. Instead of sending all of them to one destination, you route purchase events to billing analytics, errors to the alerting platform, page views to clickstream analytics, and signups to the CRM. Each downstream system gets only the data it needs.
Routing by Geography
Beyond compliance, geographic routing improves performance. Sending APAC customer data to an APAC-region Snowflake instance means analysts in Singapore are not querying across the Pacific. The latency improvement for dashboards can be dramatic.
Where to Filter: Source, Pipeline, or Destination
Where you apply filtering has a significant impact on cost and performance.
Source-level filtering means the data is never captured in the first place. For CDC pipelines, this is the most efficient approach - no WAL decoding, no serialization, no network transfer. The downside is that filters are tightly coupled to the database and changing them may require DDL changes.
Pipeline-level filtering happens after capture but before the destination. This is where Kafka Connect SMTs, Flink SQL, or platform-level rules (like those in Streamkap) operate. It is flexible and decoupled from the source, though you still pay for the initial capture.
Destination-level filtering sends everything downstream and filters with WHERE clauses in queries. It works, but for high-volume streams it can cost 5-10x more than filtering upstream.
The principle is clear: filter as early as possible.
Flink SQL Examples
Apache Flink’s SQL interface makes filtering and routing accessible without writing Java or Scala.
Row-Level Filtering with WHERE
-- Filter out test and staging records
INSERT INTO cleaned_orders
SELECT *
FROM raw_orders
WHERE environment = 'production'
AND is_test_account = false;
Content-Based Routing with Multiple INSERT Statements
-- Route EU orders to EU warehouse
INSERT INTO eu_orders
SELECT order_id, customer_id, amount, created_at
FROM raw_orders
WHERE region = 'eu';
-- Route US orders to US warehouse
INSERT INTO us_orders
SELECT order_id, customer_id, amount, created_at
FROM raw_orders
WHERE region = 'us';
-- Route high-value orders to compliance system
INSERT INTO compliance_review
SELECT order_id, customer_id, amount, created_at, payment_method
FROM raw_orders
WHERE amount > 10000;
Flink reads from raw_orders once and fans the data out to multiple sinks based on the WHERE conditions.
Column Projection for PII Stripping
-- Analytics feed without PII
INSERT INTO analytics_orders
SELECT order_id, product_category, amount, region, created_at
FROM raw_orders
WHERE environment = 'production';
-- Deliberately omitting: customer_name, email, shipping_address
By selecting only the columns needed for analytics, you create a PII-free stream safe to share with broader teams.
PostgreSQL CDC Filtering
PostgreSQL publications let you push filtering all the way to the database level - the earliest possible point for CDC pipelines.
Table-Level Publication Filtering
-- Only capture changes from specific tables
CREATE PUBLICATION streamkap_pub FOR TABLE
orders,
customers,
products,
inventory;
Any table not listed is completely ignored by the CDC connector - no WAL decoding, no serialization, no network transfer.
Row-Level Publication Filtering (PostgreSQL 15+)
PostgreSQL 15 introduced row filters on publications:
-- Only capture production orders over $100
CREATE PUBLICATION filtered_orders FOR TABLE orders
WHERE (environment = 'production' AND total_amount > 100);
-- Only capture EU customers
CREATE PUBLICATION eu_customers FOR TABLE customers
WHERE (region = 'eu');
Records that do not match the filter never leave the database. This is the most resource-efficient filtering possible for CDC.
Caveat: Row filters apply only to INSERT and UPDATE operations. DELETE events for filtered-out rows may still appear depending on replica identity configuration.
Multi-Destination Routing
Real-world pipelines rarely have a single destination. Fan-out patterns let you send the same source data to multiple systems, each receiving a tailored view.
Topic Routing in Kafka
In Kafka-based architectures, routing is topic-based. Each destination consumes from a specific topic, and the pipeline routes records accordingly:
raw_orders (source topic)
├── orders.eu → EU Snowflake instance
├── orders.us → US Snowflake instance
├── orders.highvalue → Compliance review system
└── orders.analytics → Analytics warehouse (PII stripped)
A Flink job or Kafka Streams application sits between the source and destination topics, inspecting each record and routing it. Kafka Connect SMTs can also handle simpler routing using the RegexRouter or custom transforms.
Fan-Out with Filtered Views
A more sophisticated pattern creates “materialized views” of the source data: a full replica to the data lake for archival, a PII-stripped analytics view to the warehouse, and a compliance view of high-value transactions to the audit system. Each view is a filtered, projected, routed subset of the same source stream.
Practical Example: E-Commerce CDC Pipeline
Here is a concrete example tying everything together. You have a PostgreSQL database powering an e-commerce platform. The pipeline needs to filter out test data, route orders by geography to region-specific warehouses, and strip PII from the analytics feed.
Step 1: Source-Level Filtering (PostgreSQL)
-- Publication excludes internal/test tables entirely
CREATE PUBLICATION ecommerce_pub FOR TABLE
orders, customers, products, order_items, inventory;
-- Row filter: only production data (PostgreSQL 15+)
CREATE PUBLICATION ecommerce_prod FOR TABLE orders
WHERE (environment = 'production');
Step 2: Pipeline-Level Routing and Filtering (Flink SQL)
-- EU orders with full detail to EU warehouse
INSERT INTO eu_warehouse.orders
SELECT * FROM cdc_orders
WHERE region = 'eu' AND environment = 'production';
-- US orders with full detail to US warehouse
INSERT INTO us_warehouse.orders
SELECT * FROM cdc_orders
WHERE region = 'us' AND environment = 'production';
-- PII-stripped analytics feed (all regions)
INSERT INTO analytics.orders
SELECT order_id, product_category, order_total, region,
payment_method, created_at, updated_at
FROM cdc_orders
WHERE environment = 'production';
Step 3: Result
The source database produces roughly 2 million order events per day. After filtering:
- Test data filtering removes approximately 15% of records (300K events/day).
- Column projection for the analytics feed reduces average record size by 60%.
- Geographic routing ensures each regional warehouse receives only its own data.
The downstream systems are smaller, faster, and cheaper to operate.
Cost Impact of Early Filtering
Consider a CDC pipeline producing 10 million records per day at an average size of 2 KB.
| Metric | No Filtering | With Filtering (40% reduction) |
|---|---|---|
| Daily data volume | 20 GB | 12 GB |
| Monthly Kafka storage (30-day retention) | 600 GB | 360 GB |
| Monthly data transfer to destination | 600 GB | 360 GB |
| Destination storage (1 year) | 7.2 TB | 4.3 TB |
At typical cloud pricing, the savings reach thousands of dollars per month for a single pipeline. Beyond storage, smaller datasets also mean faster destination queries, shorter batch processing windows, and lower Kafka broker costs.
Platforms like Streamkap make this straightforward by providing filtering and routing controls directly in the pipeline configuration UI - no Flink jobs or custom code required for common patterns.
Monitoring Filtered Streams
Filtering introduces a real risk: accidentally dropping data you need. Over-filtering can silently break downstream analytics, and without monitoring you may not notice for weeks.
Track Filter Rates
Every filter should have a metric tracking the ratio of dropped records to total records:
filter_rate = dropped_records / total_records
A sudden spike from 15% to 80% is a strong signal that something changed upstream - perhaps a new environment value or a renamed field. Set alerts for deviations of 10-20% sustained over an hour.
Validate Downstream Completeness
Run periodic completeness checks at the destination. Compare record counts or checksums against the source to verify filtering is working as intended. A practical approach is an “audit stream” that tracks record counts at each pipeline stage: source, post-filter, and destination.
Dead-Letter Queues for Rejected Records
Instead of silently dropping filtered records, route them to a dead-letter topic. If a filter is misconfigured, you can replay the dead-letter queue to recover the data. The storage cost is minimal compared to the risk of permanent data loss.
Streamkap provides built-in monitoring for pipeline throughput and can alert on anomalous drops in record volume, giving teams visibility into filtering behavior without custom monitoring infrastructure.
Stream filtering and routing are foundational patterns that directly impact cost, compliance, and system performance. Start with source-level filtering to eliminate data at the point of capture. Layer in pipeline-level routing for multi-destination fan-out and content-based splitting. Use column projection to strip unnecessary fields. And always monitor your filter rates to catch misconfigurations before they become data quality incidents.
The tooling (PostgreSQL publications, Flink SQL, Kafka topic routing, and managed platforms) is mature. The hardest part is not the technology. It is sitting down with your stakeholders, mapping out which data each destination actually needs, and having the discipline to avoid the “send everything everywhere” trap.