<--- Back to all resources
ETL Workflow Automation: From Manual Scripts to Real-Time Pipelines
How to automate ETL workflows — orchestration tools, CDC-based streaming, error handling patterns, and the shift from batch scripts to continuous pipelines.
Every data team starts the same way. Someone writes a Python script that pulls rows from a production database, transforms them, and loads them into a warehouse. It works. They schedule it with cron. A month later, there are forty scripts, three people who understand them, and a Slack channel dedicated to “pipeline is broken again.”
ETL workflows are the plumbing of every data operation. When they work, nobody notices. When they fail — and manual ETL fails constantly — the downstream effects spread fast: stale dashboards, broken models, wrong numbers in board decks.
This guide covers why manual ETL stops working, what modern automation looks like, and how to pick the right approach for your team.
Why Manual ETL Breaks at Scale
Manual ETL scripts are not inherently bad. For a single source, a single destination, and a tolerance for hourly data, a well-written script does the job. The problems appear when the number of sources, destinations, and freshness requirements grow.
Silent Failures
A cron job that fails at 3 AM does not page anyone unless you built alerting around it. Most teams discover failed ETL runs when someone asks why the dashboard numbers look wrong — hours or days after the failure. The script itself might exit with a non-zero code, but if nobody is watching the exit code, it might as well have succeeded.
Schema Drift
Production databases change. A column gets renamed, a new NOT NULL constraint appears, a field type changes from integer to string. Manual scripts that assume a fixed schema break without warning. The lucky case is a crash. The unlucky case is silent data corruption — the script continues running but loads garbage into the warehouse.
Scaling Bottlenecks
A script that queries a production database with SELECT * works fine for a table with 100,000 rows. At 10 million rows, it starts competing with production traffic. At 100 million rows, it either times out or degrades the application. Manual scripts rarely implement incremental extraction well, so teams end up choosing between full table scans (slow, expensive) and fragile watermark logic (fast, breaks when rows get updated out of order).
Dependency Tangles
ETL workflows have dependencies. Table B depends on Table A being loaded first. The aggregation job depends on both dimension tables being current. Manual orchestration — “run script A, wait, then run script B” — works until you have twenty jobs with overlapping dependencies. At that point, you are spending more time managing execution order than writing transforms.
Batch Orchestration: The First Layer of Automation
The first step away from cron jobs is a workflow orchestrator. These tools handle scheduling, dependency management, retries, and alerting — the operational overhead that manual scripts push onto the team.
Apache Airflow
Airflow is the default choice for most data teams. You define workflows as directed acyclic graphs (DAGs) in Python. Each node in the DAG is a task — an extraction, a transformation, a quality check — and Airflow manages execution order, retries, and logging.
What Airflow does well:
- Dependency management. If the source extraction fails, downstream transforms do not run. No half-loaded tables.
- Visibility. The web UI shows which tasks ran, which failed, how long each took, and the logs for every execution.
- Retries and backfills. Failed tasks can be retried automatically. Historical runs can be backfilled when logic changes.
- Extensibility. Operators exist for most databases, cloud services, and APIs.
Where Airflow struggles:
- Infrastructure overhead. Self-hosted Airflow requires managing a scheduler, a web server, a metadata database, and worker nodes. Managed versions (MWAA, Cloud Composer, Astronomer) reduce this but add cost.
- DAG complexity. Large Airflow installations end up with hundreds of DAGs and thousands of tasks. Debugging a failure in a 50-task DAG with dynamic task generation is not fun.
- Batch-only model. Airflow is fundamentally a scheduler. It triggers jobs on a schedule (every hour, every day). It cannot react to events in real time.
Dagster
Dagster takes a different approach: instead of defining tasks and dependencies, you define data assets and the code that produces them. The orchestrator infers dependencies from asset relationships.
This is a meaningful improvement for teams where the primary question is “is this dataset fresh and correct?” rather than “did this script run?” Dagster’s asset-centric model makes it easier to reason about data lineage, freshness, and quality.
Dagster also handles local development better than Airflow. You can run and test assets locally without standing up an orchestration cluster, which shortens the development loop.
Prefect
Prefect occupies a middle ground — less opinionated than Dagster, less infrastructure-heavy than Airflow. It uses a decorator-based Python API where you annotate functions as tasks and flows. Scheduling, retries, and logging are handled by the Prefect server (cloud or self-hosted).
Prefect is a good fit for teams that want orchestration without restructuring their existing Python code. You can take an existing script, add a few decorators, and get retry logic, logging, and a dashboard.
The Limits of Batch Orchestration
All three tools share a fundamental constraint: they operate on schedules. Even if you run a DAG every five minutes, you are still batching. The pipeline extracts a snapshot, transforms it, and loads it. Between runs, data is stale.
For daily reporting and historical analysis, this is fine. For operational use cases — keeping a search index current, syncing a cache, feeding a dashboard that analysts check throughout the day — batch latency is a real limitation.
The other constraint is extraction itself. Orchestrators manage when jobs run, but they do not change how data is extracted. If your extraction method is a full table scan or a timestamp-based query, the orchestrator automates the scheduling of that query but does not eliminate its performance impact on the source database.
Streaming ETL with Change Data Capture
Change Data Capture (CDC) takes a fundamentally different approach to extraction. Instead of querying the database on a schedule, CDC reads the database’s own transaction log — the write-ahead log in PostgreSQL, the binlog in MySQL, the oplog in MongoDB. Every insert, update, and delete is captured as it happens, with no impact on the source database’s query performance.
This changes the ETL model from “pull snapshots on a schedule” to “stream changes continuously.” The pipeline does not ask “what does the table look like now?” It asks “what changed since last time?” — and “last time” might be 200 milliseconds ago.
What CDC-Based Streaming ETL Looks Like
A CDC pipeline has three stages, same as any ETL workflow:
- Extract: A CDC connector reads the source database’s transaction log and emits change events (insert, update, delete) as a stream.
- Transform: Events are filtered, mapped, or enriched in-stream. Common transforms include column renaming, type casting, field masking, and flattening nested documents.
- Load: Transformed events are written to the destination — a warehouse, a search index, a cache, another database — in near real time.
The key difference from batch ETL: there is no “run.” The pipeline is always on. New changes flow through within seconds. There is no scheduling to configure, no watermark logic to maintain, no full table scans to manage.
Advantages Over Batch Extraction
No source database impact. Reading the transaction log is a lightweight operation. The source database does not need to execute additional queries, so there is no performance degradation during extraction.
Automatic incremental logic. CDC captures every change by design. You do not need to implement watermark columns, track high-water marks, or handle late-arriving updates. Deletes are captured too — something that timestamp-based incremental extraction misses entirely.
Sub-second freshness. Changes are available in the destination within seconds of being committed to the source. For data replication use cases, this means the destination is effectively a live replica of the source.
Schema change handling. Good CDC platforms detect schema changes in the source and propagate them downstream — adding new columns, handling type changes — without manual intervention.
Hybrid Approaches: Orchestration + Streaming
Most real-world data platforms do not choose one model exclusively. They use streaming ETL for the operational data flow and batch orchestration for the processing that happens after data lands.
A common pattern:
- CDC streams move data from source databases (PostgreSQL, MySQL, MongoDB) to a warehouse in real time.
- dbt models transform the raw CDC data in the warehouse — joins, aggregations, business logic.
- Airflow or Dagster orchestrate the dbt runs on a schedule (hourly, every 15 minutes) and manage dependencies between models.
In this architecture, the extraction and loading are continuous (CDC), but the transformation is batch (dbt on a schedule). The warehouse always has fresh raw data, and the transformed tables are at most one orchestration interval behind.
This is a pragmatic split. Extraction is where most of the operational pain lives — the schema drift, the full table scans, the watermark bugs. Automating extraction with CDC eliminates the hardest problems. Transformation in the warehouse, where SQL is the interface and compute scales on demand, is well-served by batch orchestration.
Error Handling Patterns
Automated ETL workflows need automated error handling. The patterns differ between batch and streaming, but the principles are the same: detect failures fast, recover without data loss, and alert humans only when automated recovery is not possible.
Batch Error Handling
Retry with backoff. Transient failures — network timeouts, API rate limits, temporary locks — resolve themselves. Retrying after 30 seconds, then 60, then 120 catches most of these. Airflow, Dagster, and Prefect all support configurable retry policies.
Idempotent loads. If a load step fails partway through, the retry should not create duplicates. Use MERGE/UPSERT operations keyed on primary keys, or load into staging tables and swap atomically.
Circuit breakers. If a source system is consistently failing, stop hammering it. After N consecutive failures, pause the DAG and alert the team. This protects both the source system and the on-call engineer’s sanity.
Streaming Error Handling
Dead letter queues. Events that cannot be processed — malformed data, schema violations, transformation errors — get routed to a dead letter topic instead of blocking the pipeline. Engineers review and reprocess dead-lettered events separately.
Offset management. Streaming pipelines track their position in the source stream (Kafka offset, replication slot LSN). If the pipeline restarts, it resumes from the last committed offset, not from the beginning. This gives you at-least-once delivery without replaying the entire history.
Backpressure. When the destination cannot keep up — a warehouse under heavy query load, an API hitting rate limits — the pipeline should slow down rather than buffer unboundedly in memory. Good streaming platforms handle backpressure automatically, slowing reads from the source to match the destination’s write capacity.
Automatic snapshots. If a CDC pipeline loses its replication slot or falls too far behind, it needs to re-snapshot the source table and then resume streaming from the current position. Manual re-snapshots are a common source of late-night pages. Automated snapshot recovery eliminates this.
Monitoring ETL Workflows
Automation without monitoring is just failing with confidence. Every ETL workflow — batch or streaming — needs three categories of monitoring.
Pipeline Health
- Batch: Task success/failure rates, run duration trends, SLA compliance (did the 6 AM dashboard refresh finish before 7 AM?)
- Streaming: Consumer lag (how far behind is the pipeline from the source?), throughput (events per second), error rates
Data Quality
Pipeline health tells you the code ran. Data quality tells you the output is correct. Basic checks include:
- Row count comparisons between source and destination
- Null rate monitoring on columns that should not be null
- Freshness checks — when was the last row written to the destination?
- Schema validation — do the destination columns match expectations?
Cost
ETL workflows consume compute, storage, and network bandwidth. Monitoring cost per pipeline helps catch runaway queries (a transformation that scans a 10TB table every hour when it should be incremental) and right-size infrastructure.
Choosing Your Automation Strategy
The right approach depends on your data freshness requirements, team size, and existing infrastructure.
If your data can be hours old: Batch orchestration with Airflow or Dagster is sufficient. Focus on idempotent loads, good retry logic, and alerting.
If you need data freshness under a minute: CDC-based streaming for extraction, with batch orchestration for downstream transforms. This is where most teams land — data engineers get real-time raw data without giving up the SQL-based transformation workflow they already know.
If you need sub-second freshness for operational systems: Full streaming ETL with in-stream transforms. No batch layer for the critical path. Batch orchestration handles only non-latency-sensitive workloads like ML training or monthly reports.
If you are a small team: Prioritize managed services over self-hosted. A managed CDC platform plus a lightweight orchestrator (Prefect, or even dbt Cloud’s built-in scheduler) gives you automation without the infrastructure burden.
The common mistake is over-engineering the first iteration. Start with the simplest automation that meets your freshness requirements. You can add complexity — more transforms, more destinations, more monitoring — incrementally. You cannot easily simplify a system that was built complex from day one.
Ready to replace your batch scripts with real-time pipelines? Streamkap automates CDC-based ETL from 60+ sources to your warehouse — no orchestrator required. Start a free trial or see how data engineers use Streamkap.