<--- Back to all resources

Engineering

February 25, 2026

11 min read

CDC vs ETL: Key Differences and When to Use Each

A clear, in-depth comparison of Change Data Capture and traditional Extract-Transform-Load. Understand how they differ architecturally, how they affect source system performance, which delivers fresher data, and real-world scenarios where one outperforms the other.

TL;DR: • ETL extracts and transforms data in scheduled bulk pulls; CDC captures every individual change as it happens by reading the database transaction log. • ETL imposes query load on source systems at job execution time; CDC reads the transaction log with near-zero impact on source database performance. • CDC provides sub-second data freshness; ETL freshness is bounded by the batch interval. • CDC and ETL are complementary - most production architectures use CDC for real-time operational pipelines and ETL (or ELT) for complex analytical transformations in the data warehouse.

Change Data Capture and Extract-Transform-Load are both data movement techniques, but they operate on fundamentally different principles. Choosing between them - or understanding how to combine them - is one of the most important architectural decisions in building a reliable data platform.

This guide explains what distinguishes CDC from ETL at the architectural level, when each approach is the right tool for the job, how they affect the systems they connect, and how the two techniques work together in modern data architectures.


How Traditional ETL Works

ETL - Extract, Transform, Load - is the original approach to moving data between systems. A scheduled job (the extract step) queries the source database to pull a batch of data, applies transformations (cleaning, joining, reshaping), and loads the result into a destination system such as a data warehouse.

The extract is typically incremental - the job queries for records updated since the last run, often using a updated_at timestamp or auto-incrementing ID. For example:

SELECT * FROM orders
WHERE updated_at > '2026-02-24 23:00:00'
ORDER BY updated_at;

The job runs on a schedule: every hour, every night, every Sunday. Between runs, new data accumulates in the source and waits for the next extraction.

The Core Limitations of ETL

Staleness is bounded by the batch interval. If your ETL job runs hourly, data in the destination is up to 60 minutes stale at any given moment. If the job runs nightly, data is up to 24 hours stale. For analytical reporting this is often acceptable; for operational systems it frequently is not.

Query-based extraction puts load on the source. The extraction query scans potentially large tables on the production database. At peak times or with large datasets, this can degrade production query performance - a problem that ETL schedulers mitigate by running jobs at off-peak hours, but cannot eliminate entirely.

Deletes are invisible to query-based ETL. A DELETE removes the row from the table. When the ETL job runs its next incremental query, there is nothing to extract - the deleted row is simply gone. Unless the application uses soft deletes (a deleted_at flag), hard deletes are silently dropped from the destination.

Schema changes break pipelines. When a developer adds or renames a column, the ETL query and transformation logic may fail silently or produce incorrect output until the pipeline is updated manually.


How CDC Works

Change Data Capture takes a fundamentally different approach. Instead of querying the source database, CDC reads the database’s internal transaction log - the same log the database uses for replication, crash recovery, and point-in-time restore.

Every change committed to the database - every INSERT, UPDATE, and DELETE - is recorded in the transaction log. CDC reads this log continuously and emits each change as a structured event, typically in a format that includes the operation type (insert/update/delete), the before-image (row state before the change), and the after-image (row state after the change).

These change events are streamed to a message broker (commonly Apache Kafka) or delivered directly to a destination system.

What the Transaction Log Provides

The transaction log is the authoritative record of everything that happened in the database, in the exact order it happened. Reading it gives CDC several inherent advantages:

Completeness - Every change is captured, including hard deletes. The log records a DELETE as explicitly as it records an INSERT.

Ordering - Changes are captured in the exact order they were committed, which matters when downstream consumers need to reconstruct state correctly.

Minimal source impact - Reading the transaction log is a sequential read of append-only log files. It does not run queries against production tables, does not consume significant database CPU, and does not interfere with application queries.

No application changes required - CDC works against the database directly. Applications write to the database as they always have; CDC captures what they wrote without any integration code in the application.


Key Architectural Differences

Where Data Is Captured

ETL captures data at the table level, via queries that the ETL tool runs against the source database. The source database must be available and responsive at job execution time.

CDC captures data at the transaction log level, independent of the table query interface. The log is always being written; CDC reads it asynchronously with no dependency on query performance.

When Data Is Captured

ETL captures data on a schedule. Changes between runs are invisible to the destination until the next job.

CDC captures data continuously. Changes appear in the destination within milliseconds to seconds of being committed in the source.

What Is Captured

ETL captures the current state of rows that match the extraction query. It cannot capture intermediate states (a row that was updated five times between runs will appear in the destination with only its final value) and cannot capture hard deletes.

CDC captures every state transition. If a row was updated five times between two ETL runs, CDC emits five separate change events. Hard deletes are captured as explicit DELETE events.

Source System Impact

ETL executes queries against the production database. Large table scans during peak hours can impact application query latency. Teams mitigate this by running ETL during off-peak windows, but the impact cannot be eliminated entirely.

CDC reads the transaction log, which is a sequential read of append-only files on disk or in memory. The database is already writing this log; CDC adds a reader. The performance impact is minimal - typically less than 1% additional I/O load on the source database.

Transformation Capability

ETL includes transformation as a first-class step. Tools like dbt, Spark, and Glue provide rich transformation primitives: SQL joins, window functions, aggregations, business logic encoded in Python or SQL.

CDC traditionally delivers raw change events without transformation. However, modern CDC platforms built on stream processors (Apache Flink, Kafka Streams) add in-pipeline transformation capabilities: column filtering, data type coercion, masking sensitive fields, routing changes to different destinations based on content, and simple aggregations. Complex multi-source analytical transformations remain better suited to ETL.


Performance and Freshness Comparison

DimensionTraditional ETLLog-Based CDC
Data freshnessBounded by batch interval (minutes to hours)Sub-second to seconds
Source query loadSignificant during job executionMinimal (log read only)
Captures deletesNo (without soft delete workaround)Yes
Captures intermediate statesNoYes
Schema change handlingManual update requiredAutomatic (best platforms)
Initial setup complexityLow to moderateModerate (log config required)
Transformation capabilityRich (SQL, Python)Basic to moderate (in-pipeline)

When to Use ETL

ETL remains the right choice for a wide class of data problems:

Complex analytical transformations - When you need to join data from five different source systems, apply business logic that spans multiple tables, and produce aggregated models for BI tools, ETL (particularly the modern ELT variant using dbt) is the right tool. Stream processing engines can do joins and aggregations, but they are harder to develop and debug for complex multi-source analytical logic.

Data warehousing - Loading and transforming data in Snowflake, BigQuery, or Redshift using dbt is a mature, well-understood workflow with excellent tooling. Columnar warehouses are optimized for large batch writes, not streaming micro-inserts.

Historical backfills and migrations - Moving years of historical data from one system to another is a batch operation. ETL tools are designed for this.

Low-frequency sources - If a table is updated once a day, CDC adds infrastructure complexity with no meaningful latency benefit over a nightly ETL job.

Reporting and BI - Business intelligence dashboards that analysts consult once daily do not require sub-second data freshness. ETL is simpler and more cost-effective for these workloads.


When to Use CDC

CDC is the right choice when data freshness is a functional requirement, not just a preference:

Real-time analytics - Operational dashboards tracking live orders, active users, or system health require data that reflects what is happening now, not what happened last hour.

Fraud detection and risk signals - Financial fraud patterns emerge and vanish in seconds. A fraud model fed by hourly ETL is operating blind for most of its observation window. CDC feeds fraud signals into detection systems within seconds of the triggering transaction.

Cache and search index synchronization - When Redis, Elasticsearch, or Algolia must reflect the current state of a relational database, CDC is the only way to keep them consistent without query hammering or large-batch overwrites.

Database migration with zero downtime - CDC enables live database migrations where the new database stays in sync with the old one during cutover, eliminating maintenance windows.

Microservice data synchronization - In event-driven microservice architectures, CDC from a source-of-truth database provides the reliable change stream that other services can consume, without coupling services directly to each other’s APIs.

AI and ML feature pipelines - ML models in production require feature stores that reflect current data. CDC pipelines keep feature values fresh, eliminating training-serving skew.


CDC and ETL as Complements

The most powerful data architectures use CDC and ETL together, each doing what it does best.

A common pattern: CDC pipelines capture changes from operational databases and deliver raw change events to a data warehouse in near-real time (Snowflake, Databricks, BigQuery all support streaming ingestion). Once the raw data is in the warehouse, dbt transformations produce cleaned, joined, and aggregated analytical models for BI and reporting.

In this architecture:

  • CDC handles the real-time replication layer with sub-second latency and minimal source impact
  • ETL (via dbt) handles the analytical transformation layer with rich SQL logic
  • The destination warehouse receives a continuous stream of fresh data from CDC and serves analytical queries on top of dbt-transformed models

This pattern gives teams the best of both worlds: operational systems get real-time data, analysts get clean models, and the source database is not burdened by heavy query loads.


Practical Considerations for Adopting CDC

Source Database Configuration

Log-based CDC requires enabling logical replication or CDC at the database level. For PostgreSQL this means setting wal_level = logical and configuring a replication slot. For MySQL, binlog_format = ROW must be set. For SQL Server, CDC must be enabled on specific tables via sys.sp_cdc_enable_table.

All major cloud database providers (AWS RDS, Aurora, Google Cloud SQL, Azure SQL) support these configurations. The setup is a one-time change and does not materially affect database performance after it is in place.

Replication Slots and Log Retention

PostgreSQL replication slots retain WAL files until the CDC consumer has read them. If a CDC pipeline goes offline for an extended period, the WAL can grow large and consume significant disk space. Monitoring replication slot lag is an important operational practice - managed CDC platforms like Streamkap handle this monitoring automatically and alert when slot lag grows beyond safe thresholds.

Schema Evolution

One of the operational advantages of managed CDC platforms is automatic schema evolution handling. When a developer runs an ALTER TABLE on the source, the CDC platform detects the DDL event in the transaction log, updates its internal schema, and propagates the change to the destination without manual intervention. With ETL, a schema change typically requires updating the extraction query, the transformation logic, and the destination table definition - a manual process that can cause pipeline failures if not caught quickly.


Summary

CDC and ETL are complementary techniques, not competing ones. ETL has a permanent home in data architectures that require complex analytical transformations, multi-source joins, and data warehousing. CDC has a permanent home in architectures that require real-time data movement, operational system synchronization, and minimal source impact.

The key insight is that ETL queries data; CDC observes it. Querying is inherently delayed and imposes load. Observing the transaction log is continuous and nearly costless. For any use case where the question is “what is happening right now,” CDC is the right foundation. For any use case where the question is “what did last quarter look like,” ETL remains the right tool.

Most teams that adopt CDC do not replace their existing ETL pipelines - they add CDC alongside them, progressively shifting real-time and operational workloads to the streaming layer while retaining ETL for the analytical layer.