<--- Back to all resources
What Is Real-Time Data? The Engineer's Guide to Sub-Second Pipelines
Everything you need to know about real-time data — what it is, how it works, CDC vs polling, architecture patterns, and how to build sub-second pipelines.
If you’ve spent any time around data teams in the last few years, you’ve heard the phrase “real-time data” thrown around like confetti. Product managers want real-time dashboards. ML engineers want real-time feature stores. Executives want real-time everything. But what does “real-time” actually mean when you strip away the buzzwords?
This guide breaks it down from an engineering perspective. We’ll get specific about latency numbers, walk through the architecture of real-time pipelines, and explain why Change Data Capture (CDC) has become the default starting point for most teams building these systems.
What Real-Time Data Actually Means
Real-time data is information that is available for consumption within milliseconds to low single-digit seconds of its creation at the source. That’s it. The event happens in your database, your application, or your sensor — and within that tight window, a downstream system can see it and act on it.
The word “real-time” comes from systems engineering, where it originally described hard deadlines: a flight control system that must respond within 10ms or the plane crashes. In the data engineering world, we’re usually talking about soft real-time — there’s no catastrophic failure if latency spikes to 5 seconds instead of 500ms, but the value of the data degrades quickly with delay.
Here’s a concrete example. A customer changes their shipping address on an e-commerce site. That UPDATE hits the orders table in PostgreSQL at timestamp T. In a real-time system, the fulfillment service sees that change by T + 1s. In a batch system, the fulfillment service might not see it until T + 6h, after the nightly ETL run completes.
That gap — 1 second vs 6 hours — is where real-time data earns its keep.
Real-Time vs Near-Real-Time vs Batch: Actual Numbers
These three terms get used interchangeably, which causes confusion. Let’s pin down specific latency ranges:
| Processing Model | End-to-End Latency | How It Works | Typical Use |
|---|---|---|---|
| Real-time | < 1 second | Event-driven, continuous stream processing | Fraud detection, cache invalidation, alerting |
| Near-real-time | 1 second – 5 minutes | Micro-batches or short polling intervals | Operational dashboards, search index updates |
| Batch | Minutes to hours | Scheduled jobs that process accumulated data | Nightly reports, data warehouse loads, payroll |
The boundaries are fuzzy, and people argue about them. Some teams call anything under 1 minute “real-time.” Others draw the line at 100ms. For this guide, we’ll use the definitions above because they map to meaningfully different architectures.
Batch processing collects data in bounded chunks and processes them on a schedule. Think: an Airflow DAG that runs every hour to pull new rows from a source database, transform them, and load them into Snowflake.
Near-real-time usually means micro-batching — processing small windows of data every few seconds to a few minutes. Spark Structured Streaming with a 30-second trigger interval is a classic example. It’s faster than batch, but you’re still accumulating data before processing it.
Real-time (streaming) processes each event individually as it arrives, with no accumulation window. The pipeline is always running, and each record flows through within sub-second latency. This is the domain of Kafka consumers, Flink jobs, and CDC connectors.
The architecture you need depends entirely on which of these models fits your use case. If you’re loading data into a warehouse for analyst queries, near-real-time or even batch might be fine. If you’re syncing a cache that powers production API responses, you probably need true real-time.
How Real-Time Data Pipelines Work
A real-time data pipeline has four stages. Every implementation, from a hand-rolled Kafka setup to a managed platform, follows this same structure:
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ Source │───▶│ Capture │───▶│ Transport │───▶│ Destination │
│ (Database,│ │ (CDC, Event │ │ (Kafka, msg │ │ (Warehouse, │
│ API, App)│ │ Emitter) │ │ queue) │ │ Cache, DB) │
└──────────┘ └──────────────┘ └──────────────┘ └─────────────┘
│
┌─────▼──────┐
│ Process │
│ (Filter, │
│ Transform,│
│ Enrich) │
└────────────┘
1. Source — Where data originates. This is typically an OLTP database (PostgreSQL, MySQL, MongoDB, DynamoDB), but it can also be an API, application event, IoT sensor, or message queue.
2. Capture — The mechanism that detects changes at the source and converts them into a stream of events. For databases, this is almost always Change Data Capture (CDC). For applications, it might be an event emitter that publishes to a topic whenever something happens.
3. Transport — The messaging layer that moves events from capture to consumption. Apache Kafka is the most common choice here, but Amazon Kinesis, Apache Pulsar, and Redpanda are also widely used. The transport layer handles ordering, durability, and fan-out to multiple consumers.
4. Process (optional) — Transformation, filtering, aggregation, or enrichment applied to events in-flight. This is where stream processing engines like Apache Flink or ksqlDB come in. Not every pipeline needs this stage — sometimes you just want to move data from A to B without touching it.
5. Destination — Where the processed data lands. Data warehouses (Snowflake, BigQuery, ClickHouse), caches (Redis, Elasticsearch), operational databases, or downstream applications.
The latency of your pipeline is the sum of latency across all these stages. CDC capture adds 50–200ms. Kafka adds 5–50ms per hop. A Flink transformation adds 10–100ms depending on complexity. Writing to the destination adds another 50–500ms depending on the system. Total end-to-end: typically 200ms to 2 seconds for a well-tuned pipeline.
CDC: The Foundation of Real-Time Data
If your source is a database — and for most teams, it is — then Change Data Capture is where real-time starts. CDC has become the default method for extracting data from operational databases because it solves two problems that polling-based approaches cannot.
Why Not Just Poll?
The naive approach to getting data out of a database is polling: run a SELECT * FROM orders WHERE updated_at > ? query every N seconds. This works, but it has three serious problems:
-
Load on the source database. Every poll query hits the source. At high frequency (say, every second), you’re adding constant read load to a production database. At low frequency, you increase latency.
-
Missed deletes. A
SELECTcan only see rows that exist. If a row gets deleted between polls, you’ll never know it happened. You’d need soft deletes or a separate audit mechanism. -
Missed intermediate states. If a row gets updated three times between polls, you only see the final state. For some use cases (audit logging, event sourcing), you need every intermediate change.
How CDC Actually Works
CDC reads the database’s write-ahead log (WAL in PostgreSQL, binlog in MySQL, oplog in MongoDB, change streams in DynamoDB). This is the same log the database uses internally for replication and crash recovery. Every insert, update, and delete is recorded in this log before it’s applied to the actual table data.
A CDC connector (like Debezium, or the connectors built into Streamkap) attaches to this log as a logical replication client. From the database’s perspective, it looks like another read replica. The connector reads each log entry, parses it into a structured change event, and publishes it downstream.
Here’s what a CDC change event typically looks like:
{
"op": "u",
"before": {
"id": 42,
"status": "pending",
"updated_at": "2026-03-12T10:00:00Z"
},
"after": {
"id": 42,
"status": "shipped",
"updated_at": "2026-03-12T10:05:00Z"
},
"source": {
"table": "orders",
"db": "ecommerce",
"lsn": "0/1A2B3C4D"
},
"ts_ms": 1741776300000
}
The op field tells you the operation type: c for create, u for update, d for delete, r for snapshot read. The before and after fields give you the full row state on both sides of the change. And the source metadata includes the log sequence number (LSN), which lets you track exactly where you are in the stream.
CDC Advantages
- Zero impact on source queries. Reading the WAL doesn’t run SELECT queries against your tables. The overhead is comparable to adding a read replica.
- Captures every change. Inserts, updates, deletes, and every intermediate state. Nothing is missed between polls.
- Preserves ordering. Changes arrive in the exact order they were committed to the database.
- Sub-second latency. Changes are available in the log within milliseconds of commit. A good CDC connector picks them up within 100–200ms.
For a deeper dive into CDC mechanics, check out our CDC overview guide.
Key Patterns in Real-Time Data
Once you have a stream of change events flowing, there are several well-established patterns for what you do with them.
Event Streaming
The simplest pattern: produce events to a topic, and let multiple consumers read from it independently. Kafka’s publish-subscribe model is the textbook example. Your CDC connector writes to a Kafka topic, and then your warehouse loader, your cache updater, and your analytics service all consume from that same topic at their own pace.
This is a fan-out pattern. One source change triggers updates in multiple downstream systems, and each consumer manages its own offset (position in the stream). If the warehouse loader falls behind, the cache updater isn’t affected.
Stream Processing
When you need to transform, aggregate, or enrich events before they reach their destination, you add a stream processing layer. Common operations include:
- Filtering: Drop events that don’t match certain criteria (e.g., only forward orders above $100)
- Transformation: Reshape events, rename fields, convert data types
- Enrichment: Join streaming events with reference data (e.g., join an order event with customer profile data)
- Aggregation: Compute running counts, sums, or averages over time windows (e.g., orders per minute by region)
Apache Flink is the strongest option here for stateful stream processing. It handles event-time semantics, exactly-once processing, and large state with RocksDB-backed checkpoints. A simple Flink SQL job to aggregate orders by status in 1-minute windows looks like this:
SELECT
status,
COUNT(*) AS order_count,
TUMBLE_END(event_time, INTERVAL '1' MINUTE) AS window_end
FROM orders_stream
GROUP BY
status,
TUMBLE(event_time, INTERVAL '1' MINUTE);
If you’re evaluating Flink for your pipeline, we’ve written about stream processing patterns and Flink vs Spark in detail.
Materialized Views
A materialized view is a precomputed query result that stays up-to-date as the underlying data changes. In the batch world, you’d rebuild the view periodically with a REFRESH MATERIALIZED VIEW command or a scheduled dbt run. In the streaming world, the view updates continuously as new events arrive.
This pattern is especially powerful for read-heavy workloads. Instead of running an expensive JOIN across three tables every time a user loads a dashboard, you maintain a pre-joined, pre-aggregated table that updates in real time. The read path becomes a simple key lookup.
Flink, ksqlDB, and Materialize all support this pattern natively. You define the query once, and the engine maintains the result as a continuously updated table.
Common Real-Time Architectures
Three architectural patterns dominate real-time data systems. Each makes different trade-offs around complexity, latency, and consistency.
Lambda Architecture
Lambda runs two parallel pipelines: a batch layer for accuracy and a speed layer for low latency. The batch layer reprocesses all historical data periodically (usually with Spark or similar), producing the “correct” result. The speed layer processes only recent events in real time, producing approximate results. A serving layer merges the two.
┌─────────────────────┐
│ Batch Layer │
Source ──────▶│ (Spark / Hadoop) │──────┐
│ └─────────────────────┘ │ ┌──────────┐
│ ├───▶│ Serving │
│ ┌─────────────────────┐ │ │ Layer │
└──────────▶│ Speed Layer │──────┘ └──────────┘
│ (Flink / Kafka) │
└─────────────────────┘
Pros: You get both real-time speed and batch-level accuracy. If the speed layer has a bug, the batch layer corrects it on the next run.
Cons: You’re maintaining two separate codepaths that need to produce the same results. This is genuinely painful to operate. Every schema change, every business logic update, every bug fix has to be applied in two places.
Kappa Architecture
Kappa simplifies Lambda by eliminating the batch layer entirely. Everything runs through a single streaming pipeline. If you need to reprocess historical data (say, to fix a bug in your transformation logic), you replay the event log from the beginning.
Source ──── CDC ────▶ Kafka ────▶ Flink ────▶ Destination
│
└── (replay from offset 0 for reprocessing)
Pros: One codebase, one pipeline, one set of semantics. Much simpler to operate than Lambda.
Cons: Reprocessing by replaying the full log can be slow and expensive if you have months or years of data. You also need Kafka (or equivalent) to retain events long enough to replay, which means significant storage costs at scale.
For most teams building new real-time systems today, Kappa is the default choice. The operational simplicity outweighs the reprocessing trade-off, especially since modern tools like Flink support savepoints and incremental reprocessing.
CDC-to-Warehouse (Streaming ETL)
This is the most common pattern in practice, even if it doesn’t have a fancy name. You use CDC to capture changes from your operational database, stream them through Kafka (or directly), and load them into a cloud data warehouse like Snowflake, BigQuery, or Databricks.
PostgreSQL ──── CDC ────▶ Kafka ────▶ Snowflake
MongoDB ──── CDC ────▶ Kafka ────▶ BigQuery
MySQL ──── CDC ────▶ Kafka ────▶ Databricks
This replaces the traditional batch ETL pattern (Airflow + dbt + nightly loads) with a continuous pipeline. Your warehouse tables are always within seconds or minutes of the source, instead of hours behind.
This is the pattern that Streamkap is built around. We handle the CDC extraction, transport, and warehouse loading as a single managed pipeline. No Kafka cluster to provision, no connectors to configure — you point it at your source database and your destination, and data starts flowing.
For architecture comparisons, take a look at our guide on data pipeline architectures.
Real-Time Use Cases (With Specifics)
Let’s move beyond vague “real-time analytics” hand-waving and talk about specific use cases where real-time data makes a measurable difference.
Fraud Detection
A payment processor needs to evaluate transactions against fraud rules before the authorization response is sent back to the merchant. That response timeout is typically 2–3 seconds. If your fraud scoring system runs on batch data that’s 6 hours stale, you’re making decisions based on yesterday’s patterns. A real-time pipeline feeds transaction events to a Flink job that computes rolling aggregates (transactions per card per hour, velocity across merchants) and scores each transaction in under 200ms.
Operational Analytics
A logistics company tracks 50,000 active shipments. The operations team needs to see delays, exceptions, and capacity constraints as they develop — not in tomorrow’s morning report. CDC captures every status update from the TMS (Transportation Management System) and streams it to ClickHouse, where the dashboard queries run. End-to-end latency: ~3 seconds.
AI/ML Feature Serving
An e-commerce recommendation engine needs features like “items viewed in last 5 minutes” and “cart additions in current session.” These features are useless if they’re computed from a batch pipeline that ran 2 hours ago. A real-time feature store uses Kafka + Flink to maintain sliding window aggregates and serves them to the model at inference time with p99 latency under 50ms.
Cache Synchronization
Your application reads product catalog data from Redis to avoid hitting PostgreSQL on every request. But when a product price changes in PostgreSQL, how quickly does Redis reflect that? With polling, maybe 30 seconds at best (and you’re hammering Postgres with queries). With CDC, the price change flows from the PostgreSQL WAL to Redis in under a second, with zero additional load on the source database. For a deeper look at this pattern, see our guide on CDC-powered cache sync.
Search Index Updates
When a product description changes in your database, how long until it’s searchable in Elasticsearch? Batch reindexing might mean waiting for the next scheduled run. CDC streams the change to Elasticsearch within seconds, keeping your search results consistent with your source of truth. We’ve covered this pattern in our PostgreSQL to Elasticsearch CDC guide.
Building Your First Real-Time Pipeline
If you’re starting from scratch, here’s a practical path to get a real-time pipeline running. We’ll use the CDC-to-warehouse pattern since it’s the most common starting point.
Step 1: Prepare Your Source Database
For CDC to work, your database needs logical replication enabled. On PostgreSQL, this means setting:
-- postgresql.conf
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
On RDS or Cloud SQL, you’d set these through parameter groups. You also need a replication slot and publication:
-- Create a publication for the tables you want to stream
CREATE PUBLICATION my_publication FOR TABLE orders, customers, products;
For MySQL, ensure binlog_format = ROW and binlog_row_image = FULL.
Step 2: Choose Your CDC Connector
You have three main options:
Debezium (self-managed): Open-source, runs on Kafka Connect. Full control, but you’re managing the Kafka cluster, Connect workers, and connector lifecycle yourself. Good if you already have Kafka infrastructure.
Managed platforms (Streamkap, Fivetran, Airbyte): SaaS services that handle CDC extraction and delivery. You configure the source and destination through a UI or API. Streamkap is specifically built for real-time CDC with sub-second latency and exactly-once delivery.
Database-native CDC: Some databases offer built-in change streams (MongoDB, DynamoDB). These give you change events directly, but you still need something to consume and route them.
Step 3: Set Up the Transport Layer
If you’re self-managing, you need a Kafka cluster (or Redpanda, Pulsar, etc.) as the transport layer. For a production setup, plan for:
- At least 3 brokers for fault tolerance
- Topic retention sized for your replay needs (7 days is a reasonable default)
- Partitioning aligned with your ordering requirements (usually by primary key)
If you’re using a managed platform, this layer is abstracted away. Streamkap handles the message transport internally — you don’t provision or manage any messaging infrastructure.
Step 4: Configure Your Destination
Point your pipeline at the destination. For a Snowflake warehouse, you’ll need:
- A dedicated warehouse (XSMALL is fine to start)
- A database and schema for the CDC tables
- A user with write permissions
- The connector configured with your Snowflake account URL and credentials
Most CDC platforms create destination tables automatically, mirroring the source schema. They’ll also handle schema evolution — when you add a column to the source table, it appears in the destination within minutes.
Step 5: Monitor and Tune
Once data is flowing, watch these metrics:
- Replication lag: The time between a change being committed at the source and arriving at the destination. Should be under 10 seconds for most setups.
- Throughput: Events per second. Make sure your pipeline can handle peak load, not just average.
- Error rate: Failed deliveries, serialization errors, schema conflicts. These should be zero in steady state.
- WAL retention: If your consumer falls behind, the WAL grows. PostgreSQL will keep WAL segments until the replication slot confirms it has consumed them. Set
max_slot_wal_keep_sizeto prevent unbounded growth.
Tools and Platforms Compared
Here’s an honest comparison of the main tools you’ll encounter when building real-time data pipelines.
Apache Kafka
The default event streaming platform. Kafka handles high-throughput, fault-tolerant message transport between producers and consumers.
- Strengths: Battle-tested at massive scale (LinkedIn, Netflix, Uber), strong ecosystem, configurable retention
- Weaknesses: Operational overhead is significant — managing brokers, ZooKeeper (or KRaft), topic configs, consumer groups, monitoring. JVM tuning is non-trivial
- When to use: You have a dedicated data platform team and need multi-consumer fan-out at scale
Apache Flink
The strongest open-source stream processing engine. Flink handles stateful computations over unbounded data streams with exactly-once semantics.
- Strengths: True event-time processing, large state management, SQL interface, exactly-once guarantees
- Weaknesses: Steep learning curve, operationally demanding (checkpoint tuning, state backend sizing, TaskManager memory). Running it well requires deep expertise
- When to use: You need stateful transformations, windowed aggregations, or complex event processing on streaming data
Debezium
Open-source CDC platform that runs on Kafka Connect. Supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and more.
- Strengths: Broad database support, active community, well-documented, free
- Weaknesses: Requires Kafka and Kafka Connect infrastructure. Connector configuration is YAML/JSON-heavy and error-prone. Monitoring and alerting requires additional tooling
- When to use: You already run Kafka and want CDC without adding a new vendor
Managed Platforms
Services like Streamkap, Fivetran, and Airbyte handle the infrastructure for you.
| Feature | Streamkap | Fivetran | Airbyte |
|---|---|---|---|
| Primary focus | Real-time CDC | Batch/micro-batch ELT | Batch ELT |
| Typical latency | Sub-second to seconds | 1 min – 6 hours | 5 min – 24 hours |
| CDC method | Log-based (WAL/binlog) | Mixed (log + query) | Mixed (log + query) |
| Stream processing | Built-in (Flink) | None | None |
| Delivery guarantee | Exactly-once | At-least-once | At-least-once |
The trade-off with managed platforms is control vs. operational burden. You give up some configurability in exchange for not having to page an engineer at 3am because a Kafka broker ran out of disk.
When Self-Managed Makes Sense
Go self-managed (Kafka + Debezium + Flink) when:
- You have a platform team with streaming expertise
- You need custom processing logic that doesn’t fit managed platform constraints
- You’re already running Kafka for other use cases
- You need to keep data entirely within your own infrastructure
When Managed Makes Sense
Go managed when:
- You want real-time data without hiring a streaming infrastructure team
- Your primary goal is CDC from databases to warehouses/lakehouses
- You’d rather spend engineering time on application logic than Kafka operations
- You need to be up and running in hours, not weeks
The Cost of Delay
Before you decide whether real-time data is worth the investment, think about what stale data is actually costing you today. Here are some questions worth asking:
- How many customer support tickets come in because the dashboard shows outdated information?
- How much revenue is lost to fraud that a 6-hour-old model can’t detect?
- How many operational decisions are made on data that’s already hours behind reality?
- How often does your ML model serve predictions based on stale features?
Real-time doesn’t have to mean rebuilding your entire data stack overnight. Most teams start with one high-value pipeline — usually the one where stale data causes the most visible pain — and expand from there.
The tooling has matured to the point where a single engineer can set up a production CDC pipeline in an afternoon, not a multi-month infrastructure project. The question isn’t whether you can do real-time. It’s whether you can afford not to.
Ready to build your first real-time pipeline? Streamkap handles CDC, streaming, and delivery to your warehouse or lakehouse — no Kafka cluster to manage. Start a free trial or see how it works.