<--- Back to all resources
Infrastructure as Code for Data Pipelines: Terraform, Pulumi, and API-First Approaches
Learn how to manage real-time data pipelines with Terraform and Infrastructure as Code. Includes HCL examples, GitOps workflows, and platform comparisons.
If you have ever spent a Friday afternoon clicking through a web UI to recreate a data pipeline that someone accidentally deleted, you already understand the problem this article is here to solve. Managing data pipelines through point-and-click interfaces — what the industry has started calling “ClickOps” — might feel convenient when you have two or three pipelines. But the moment your organization scales to dozens or hundreds of CDC pipelines across multiple environments, that convenience turns into a liability.
The truth is, the rest of your engineering stack figured this out years ago. Application code lives in Git. Cloud infrastructure is defined in Terraform. Kubernetes manifests are reviewed in pull requests. But data pipelines? For many teams, they still live in a UI somewhere, configured by hand, documented in someone’s head, and completely unreproducible if something goes wrong.
It is time to change that. In this guide, we will walk through how Infrastructure as Code (IaC) principles apply to data pipelines, show you real Terraform and YAML examples for defining CDC pipelines as code, and compare IaC support across the major data integration platforms. Whether you are already using a data sync tool with Terraform for your cloud resources or just starting to think about infrastructure as code data pipelines, this article will give you a practical roadmap.
What Is Infrastructure as Code for Data Pipelines?
Infrastructure as Code is the practice of managing and provisioning infrastructure through machine-readable configuration files rather than through manual processes. When applied to cloud resources, this usually means tools like Terraform, Pulumi, or CloudFormation defining your servers, networks, and databases as code. The same principle works beautifully for data infrastructure.
For data pipelines specifically, IaC means defining your entire data integration layer — sources, destinations, transformations, and pipeline configurations — as version-controlled code. Instead of logging into a dashboard and clicking “Create Pipeline,” you write a configuration file that describes exactly what the pipeline should look like, commit it to Git, and let your CI/CD system handle the deployment.
This is a meaningful shift in how data teams operate. When your pipeline definitions live in code, they inherit all the benefits that software engineering teams have relied on for decades:
- Version control: Every change to a pipeline is tracked, attributable, and reversible.
- Code review: Pipeline changes go through pull requests, just like application code.
- Reproducibility: Spin up an identical copy of your production pipelines in a staging environment with a single command.
- Automation: Deploy pipeline changes through the same CI/CD system you use for everything else.
- Documentation: The code itself serves as living documentation of your data infrastructure.
If your team is already managing data integration platforms with Terraform support for your cloud providers, extending that same workflow to your data pipelines is a natural next step. You are not learning a new process; you are applying a proven one to a new domain.
Why Data Pipelines Need IaC
Let us be specific about the problems that IaC solves for data teams. These are not theoretical concerns — they are the daily frustrations that slow down data engineers and introduce risk into data-critical operations.
Reproducibility and Environment Parity
One of the most common pain points in data engineering is the drift between development and production environments. A pipeline that works perfectly in dev breaks in production because someone configured the production version slightly differently three months ago. Without IaC, there is no reliable way to ensure that your staging pipelines are exact replicas of production.
With IaC, environment parity becomes trivial. You define your pipeline once, parameterize the environment-specific values (hostnames, credentials, database names), and deploy the same configuration to every environment. The pipeline in staging is provably identical to the one in production, minus the environment-specific variables.
Disaster Recovery
What happens when a team member accidentally deletes a critical pipeline? Or when a platform outage corrupts your pipeline configurations? Without IaC, recovery means painstakingly recreating everything from memory, screenshots, and half-complete wiki pages. With IaC, recovery is a terraform apply away. Your entire data infrastructure can be rebuilt from code in minutes, not days.
Audit Trails and Compliance
For teams operating under regulatory requirements — SOC 2, HIPAA, GDPR — the ability to demonstrate who changed what and when is not optional. Git provides a complete, immutable audit trail of every pipeline change. Every modification goes through a pull request with an approver, a timestamp, and a description of why the change was made. This is the kind of governance story that makes auditors happy and keeps your data governance practices airtight.
Team Collaboration and Knowledge Sharing
When pipeline configurations live in a UI, knowledge is siloed. Only the person who built the pipeline knows how it is configured, and when they go on vacation or leave the company, that knowledge walks out the door. IaC democratizes pipeline knowledge. Anyone on the team can read the configuration, understand how a pipeline is set up, and propose changes through a pull request.
Scaling Operations
Managing five pipelines through a UI is manageable. Managing fifty is tedious. Managing five hundred is impossible. IaC lets you templatize common patterns, use loops and conditionals to generate configurations, and manage large fleets of pipelines with the same effort it takes to manage a handful. This is where the ability to manage data pipelines with Terraform really shines — your data infrastructure scales with your organization, not against it.
Terraform for Data Pipelines
If you are not already familiar with Terraform, here is a quick primer. Terraform is an open-source IaC tool created by HashiCorp that lets you define infrastructure in a declarative configuration language called HCL (HashiCorp Configuration Language). You describe the desired state of your infrastructure, and Terraform figures out the steps to make it happen.
Terraform’s power comes from its provider ecosystem. A provider is a plugin that teaches Terraform how to manage a specific type of resource. There are providers for AWS, GCP, Azure, Kubernetes, and — increasingly — for data platforms. When a data integration platform offers a Terraform provider, it means you can define your pipelines, sources, and destinations as HCL code and manage them through standard Terraform workflows.
Here is why that matters for data teams:
- Declarative: You describe what you want, not how to get there. Terraform handles creating, updating, and deleting resources to match your desired state.
- Plan before apply:
terraform planshows you exactly what will change before you make any modifications. No surprises. - State management: Terraform tracks the current state of your infrastructure, so it knows the difference between what exists and what you want to exist.
- Ecosystem integration: Terraform works alongside your existing cloud infrastructure definitions. Your data pipelines can live in the same repository as your VPCs, databases, and Kubernetes clusters.
Example: Defining a CDC Pipeline with Terraform
Let us get practical. Here is what it looks like to define a complete Change Data Capture pipeline using Terraform — from source database to analytics warehouse. This example uses Streamkap’s Terraform provider to create a PostgreSQL-to-Snowflake CDC pipeline.
First, configure the provider:
terraform {
required_providers {
streamkap = {
source = "streamkap/streamkap"
version = "~> 2.0"
}
}
}
provider "streamkap" {
# Credentials loaded from environment variables:
# STREAMKAP_CLIENT_ID and STREAMKAP_CLIENT_SECRET
}
Next, define your source — a PostgreSQL database that Streamkap will capture changes from:
variable "pg_host" {
description = "PostgreSQL hostname"
type = string
sensitive = false
}
variable "pg_password" {
description = "PostgreSQL password"
type = string
sensitive = true
}
resource "streamkap_source" "postgres_main" {
name = "production-postgres"
type = "postgresql"
hostname = var.pg_host
port = 5432
database = "app_db"
username = "streamkap_cdc"
password = var.pg_password
schema_include_list = "public"
snapshot_mode = "initial"
signal_data_collection = "public.streamkap_signal"
ssh_enabled = true
ssh_host = "bastion.example.com"
ssh_port = 22
ssh_user = "streamkap"
}
Now define your destination — a Snowflake data warehouse where the CDC data will land:
variable "snowflake_url" {
description = "Snowflake account URL"
type = string
}
variable "snowflake_private_key" {
description = "Snowflake private key for authentication"
type = string
sensitive = true
}
resource "streamkap_destination" "snowflake_warehouse" {
name = "analytics-snowflake"
type = "snowflake"
snowflake_url = var.snowflake_url
snowflake_user = "STREAMKAP_LOADER"
snowflake_private_key = var.snowflake_private_key
snowflake_database = "ANALYTICS"
snowflake_schema = "RAW_CDC"
snowflake_warehouse = "LOADING_WH"
snowflake_role = "STREAMKAP_ROLE"
ingestion_mode = "append"
dedupe_enabled = true
batch_size = 10000
}
Finally, wire them together with a pipeline resource:
resource "streamkap_pipeline" "pg_to_snowflake" {
name = "pg-to-snowflake-cdc"
source_id = streamkap_source.postgres_main.id
destination_id = streamkap_destination.snowflake_warehouse.id
snapshot_new_tables = true
transforms = [
{
type = "com.streamkap.transforms.ColumnFilter"
config = {
column_exclude_list = "public.users.ssn,public.users.password_hash"
}
}
]
}
That is a complete, production-ready CDC pipeline defined in about 80 lines of HCL. Let us talk about what you get from this approach versus configuring the same pipeline through a UI:
- Reviewable: A teammate can look at this code in a pull request and immediately understand what the pipeline does, what tables it captures, which columns it excludes, and how it connects to both systems.
- Reproducible: Copy this file, swap the variables for a staging environment, and you have an identical pipeline for testing.
- Auditable: Git tracks who added the SSN column exclusion and when. That matters for compliance.
- Testable: You can validate this configuration in CI before it ever touches production.
GitOps Workflows for Data Infrastructure
Defining pipelines as code is just the beginning. The real power comes when you integrate those definitions into a GitOps workflow — where Git is the single source of truth for your data infrastructure, and changes flow from pull requests through CI/CD to deployment.
Here is what a mature GitOps workflow for data pipelines looks like:
The Pull Request Lifecycle
- Branch: A data engineer creates a feature branch to add a new pipeline or modify an existing one.
- Code: They write or update the Terraform HCL (or YAML configuration) for the pipeline.
- Push and PR: They push the branch and open a pull request.
- Automated validation: CI runs
terraform validateandterraform plan, posting the results as a comment on the PR. - Review: A teammate reviews the plan output, checking for unintended changes or security concerns.
- Merge: Once approved, the PR merges to the main branch.
- Deploy: CI/CD runs
terraform applyautomatically, deploying the pipeline changes.
This is the same workflow your platform engineering team uses for cloud infrastructure. Data pipelines are no longer a special case — they are part of the same governed, automated deployment process.
YAML-Based Pipeline Configuration
Not every team wants or needs Terraform. For simpler workflows, YAML-based pipeline configurations offer a lightweight alternative that is still Git-friendly and machine-readable. Here is an example of a pipeline definition in YAML:
# pipeline-config/pg-to-snowflake.yaml
pipeline:
name: "pg-to-snowflake-cdc"
description: "Real-time CDC from production PostgreSQL to Snowflake analytics"
environment: production
source:
type: postgresql
name: "production-postgres"
connection:
hostname: "${PG_HOST}"
port: 5432
database: "app_db"
username: "streamkap_cdc"
password: "${PG_PASSWORD}"
cdc:
schema_include_list:
- "public"
snapshot_mode: "initial"
signal_collection: "public.streamkap_signal"
destination:
type: snowflake
name: "analytics-snowflake"
connection:
account_url: "${SNOWFLAKE_URL}"
user: "STREAMKAP_LOADER"
private_key: "${SNOWFLAKE_PRIVATE_KEY}"
database: "ANALYTICS"
schema: "RAW_CDC"
warehouse: "LOADING_WH"
role: "STREAMKAP_ROLE"
settings:
ingestion_mode: "append"
dedupe_enabled: true
batch_size: 10000
transforms:
- type: column_filter
config:
exclude:
- "public.users.ssn"
- "public.users.password_hash"
monitoring:
alerts:
- type: latency_threshold
threshold_ms: 5000
notify: "#data-alerts"
- type: pipeline_failure
notify: "#data-oncall"
YAML configurations like this pair naturally with GitOps tools. You can store them alongside your application code, diff them in pull requests, and deploy them through scripts that call the Streamkap REST API.
GitHub Sync for Transforms
One particularly powerful pattern is syncing your transformation logic directly from a GitHub repository. Streamkap supports GitHub Sync for transforms, which means your SQL or Flink-based transformation code lives in Git and automatically deploys when you merge changes. This eliminates the gap between where your transform logic is authored and where it runs — no manual copy-paste from an IDE to a web UI.
Beyond Terraform: Pulumi, REST API, and CLI
Terraform is the most widely adopted IaC tool, but it is not the only option. Different teams have different preferences, and a good data platform should support multiple approaches to automation.
Pulumi
If your team prefers writing infrastructure code in a general-purpose programming language rather than a domain-specific one like HCL, Pulumi is the alternative to look at. Pulumi lets you define infrastructure using Python, TypeScript, Go, or C#, which means you get the full power of loops, conditionals, and abstractions without learning a new language.
Here is what the same PostgreSQL source definition looks like in Pulumi with Python:
import streamkap
postgres_source = streamkap.Source(
"postgres_main",
name="production-postgres",
type="postgresql",
hostname=config.require("pg_host"),
port=5432,
database="app_db",
username="streamkap_cdc",
password=config.require_secret("pg_password"),
schema_include_list="public",
snapshot_mode="initial",
)
For teams already writing Python or TypeScript daily, Pulumi can feel more natural than HCL. The trade-off is that Pulumi has a smaller community and ecosystem compared to Terraform, so you may find fewer examples and third-party modules.
REST API
Sometimes you need more flexibility than a declarative IaC tool provides. Maybe you are building a self-service portal where internal teams can provision their own pipelines, or you are integrating pipeline management into a larger orchestration system. This is where a comprehensive REST API becomes essential.
Streamkap’s REST API gives you full programmatic control over every aspect of your data pipelines. You can create, update, delete, and monitor sources, destinations, and pipelines through standard HTTP endpoints. Here is a quick example using curl:
# Create a new source
curl -X POST https://api.streamkap.com/api/sources \
-H "Authorization: Bearer ${STREAMKAP_API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"name": "production-postgres",
"connector": "postgresql",
"config": {
"hostname": "db.example.com",
"port": 5432,
"database": "app_db",
"username": "streamkap_cdc",
"schema.include.list": "public"
}
}'
The REST API is the foundation that both the Terraform provider and CLI are built on. If you can call an API, you can automate your data pipelines, regardless of which language or toolchain you prefer.
CLI
For day-to-day operations and scripting, Streamkap also offers a command-line interface. The CLI is ideal for quick one-off tasks, shell scripts, and integration with tools that are not Terraform-aware. Think of it as the middle ground between the full power of the REST API and the convenience of a web UI.
# List all pipelines
streamkap pipelines list
# Get pipeline status
streamkap pipelines status pg-to-snowflake-cdc
# Pause a pipeline
streamkap pipelines pause pg-to-snowflake-cdc
# Resume with a fresh snapshot
streamkap pipelines resume pg-to-snowflake-cdc --snapshot
The key takeaway here is that there is no single “right” approach. The best data integration platforms with Terraform support also offer REST APIs, CLIs, and other automation interfaces, so you can pick the right tool for each situation.
Comparing IaC Support Across Platforms
Not all data integration platforms treat IaC as a first-class concern. Some offer comprehensive Terraform providers and APIs, while others still expect you to do everything through a web UI. Here is how the major platforms stack up:
| Capability | Streamkap | Fivetran | Airbyte | Confluent |
|---|---|---|---|---|
| Terraform Provider | Yes (all plans) | Yes (Enterprise) | Community-maintained | Yes (Confluent Cloud) |
| Pulumi Provider | Yes | No | No | Community-maintained |
| REST API | Full API (all plans) | Yes (limited in lower tiers) | Yes | Yes |
| CLI Tool | Yes | Limited | Yes (octavia-cli) | Yes (confluent CLI) |
| YAML Pipeline Config | Yes | No | Yes (octavia) | Partial |
| GitHub Sync | Yes (transforms) | No | No | No |
| GitOps Workflow Support | Native | Limited | Community tooling | Partial |
| Environment Separation | Services (Scale+) | Separate accounts | Separate instances | Environment resources |
| API/IaC Cost | Included in all plans | Premium tier feature | Free (self-hosted) | Included |
A few things stand out in this comparison. First, API and IaC access should not be a premium feature. If you are paying for a data integration platform, you should be able to automate it without paying extra. Streamkap includes its Terraform provider, REST API, and CLI in every plan, from Starter at $600/month all the way through Enterprise.
Second, the depth of IaC support matters. Having a Terraform provider is a good start, but full GitOps support — including YAML configs, GitHub Sync for transforms, and environment separation — is what enables truly automated data operations.
Third, community-maintained providers are a risk. When a Terraform provider is not officially maintained by the platform vendor, it tends to lag behind new features, have inconsistent quality, and lack support when things break.
Best Practices for IaC Data Pipelines
Once you have decided to manage your data pipelines with Terraform (or any IaC tool), there are some practices that will save you from common headaches down the road.
Separate State by Environment
Never share Terraform state between environments. Use separate state files (or Terraform workspaces, though many teams prefer separate state files) for development, staging, and production. This prevents a terraform apply in dev from accidentally modifying a production pipeline.
# environments/production/backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "data-pipelines/production/terraform.tfstate"
region = "us-east-1"
}
}
# environments/staging/backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "data-pipelines/staging/terraform.tfstate"
region = "us-east-1"
}
}
Streamkap’s Services feature (available on Scale and Enterprise plans) complements this pattern by providing logical separation within the platform itself, so your dev and production pipelines are isolated at the infrastructure level, not just in your Terraform code.
Never Hardcode Secrets
This should go without saying, but database passwords, API keys, and private keys should never appear in your Terraform code or be committed to Git. Use variables with the sensitive flag, and inject values from a secrets manager at runtime.
variable "pg_password" {
description = "PostgreSQL password"
type = string
sensitive = true # Prevents value from appearing in logs/plan output
}
# In CI/CD, set via environment variable:
# export TF_VAR_pg_password=$(vault kv get -field=password secret/data/postgres)
Popular options for secrets management include HashiCorp Vault, AWS Secrets Manager, and GitHub Actions encrypted secrets. The important thing is that secrets never touch your codebase.
Use Modules for Reusable Patterns
If you find yourself defining similar pipelines repeatedly — say, multiple PostgreSQL databases all going to Snowflake — extract the common pattern into a Terraform module. This reduces duplication and ensures consistency.
# modules/pg-to-snowflake/main.tf
variable "source_name" {}
variable "pg_host" {}
variable "pg_database" {}
variable "snowflake_schema" {}
resource "streamkap_source" "postgres" {
name = var.source_name
type = "postgresql"
hostname = var.pg_host
database = var.pg_database
# ... common configuration
}
resource "streamkap_pipeline" "cdc" {
name = "${var.source_name}-to-snowflake"
source_id = streamkap_source.postgres.id
destination_id = var.snowflake_destination_id
# ... common pipeline configuration
}
# environments/production/main.tf
module "orders_pipeline" {
source = "../../modules/pg-to-snowflake"
source_name = "orders-db"
pg_host = "orders.db.example.com"
pg_database = "orders"
snowflake_schema = "ORDERS_RAW"
snowflake_destination_id = streamkap_destination.snowflake_warehouse.id
}
module "users_pipeline" {
source = "../../modules/pg-to-snowflake"
source_name = "users-db"
pg_host = "users.db.example.com"
pg_database = "users"
snowflake_schema = "USERS_RAW"
snowflake_destination_id = streamkap_destination.snowflake_warehouse.id
}
Now you can spin up a new CDC pipeline by adding a single module block. That is the kind of operational leverage that makes IaC worth the initial investment.
Implement CI/CD Validation
Every pull request that modifies pipeline configurations should trigger automated validation. At minimum, run terraform validate and terraform plan in CI. For extra confidence, consider adding:
- Policy checks: Use tools like Open Policy Agent (OPA) or Sentinel to enforce rules like “all pipelines must exclude PII columns” or “production pipelines must have alerting configured.”
- Drift detection: Periodically run
terraform planto detect any manual changes made outside of Terraform. If someone clicks through the UI and modifies a pipeline, your CI should flag the drift. - Cost estimation: Tools like Infracost can estimate the cost impact of pipeline changes before they are deployed.
Plan for Import and Migration
If you already have pipelines running in a platform and you want to bring them under Terraform management, you will need to import existing resources into Terraform state. Most Terraform providers support the terraform import command, which lets you bring existing resources under management without recreating them.
# Import an existing source into Terraform state
terraform import streamkap_source.postgres_main src_abc123
# Import an existing pipeline
terraform import streamkap_pipeline.pg_to_snowflake pipe_xyz789
This is a critical step when adopting IaC for an existing data platform. You do not want to tear down and recreate production pipelines just to bring them under Terraform management.
Tag and Organize Resources
As your fleet of pipelines grows, consistent naming and tagging becomes essential. Establish a naming convention early and stick to it.
locals {
environment = "production"
team = "data-engineering"
project = "analytics"
}
resource "streamkap_pipeline" "pg_to_snowflake" {
name = "${local.environment}-${local.project}-pg-to-snowflake"
# ...
}
Clear naming conventions make it easier to search, filter, and audit your pipelines — both in Terraform state and in the Streamkap platform UI.
Bringing It All Together
The shift from ClickOps to Infrastructure as Code for data pipelines is not just a technical preference — it is an operational necessity for teams that want to move fast without breaking things. When your pipeline definitions live in Git, reviewed in pull requests, and deployed through CI/CD, you get the same reliability and velocity that the rest of your engineering organization already enjoys.
Here is a practical summary of what we covered:
- IaC eliminates configuration drift between environments and makes disaster recovery a non-event.
- Terraform providers let you define sources, destinations, and pipelines as declarative HCL code that is reviewed, versioned, and deployed automatically.
- YAML configurations and GitHub Sync offer lighter-weight alternatives for teams that do not need the full Terraform experience.
- REST APIs and CLIs provide the flexibility to integrate pipeline management into any workflow or tool.
- Best practices like environment separation, secrets management, reusable modules, and CI/CD validation keep your IaC data pipelines production-ready as you scale.
The platforms that get this right are the ones that treat IaC as a first-class feature, not an afterthought. That means including API and Terraform access in every plan, supporting multiple automation approaches, and enabling environment separation for real-world multi-stage deployments.
Get Started with IaC Data Pipelines
Streamkap includes its Terraform provider, Pulumi provider, REST API, and CLI in every pricing plan — Starter, Scale, and Enterprise. You do not need to upgrade or pay extra to automate your data pipelines as code.
If you are ready to stop clicking and start coding your data infrastructure:
- Start your free 30-day trial — no credit card required.
- Explore the connector catalog to see which sources and destinations are available.
- Check out the platform overview for a deeper look at Streamkap’s architecture, including managed Kafka, Flink-based transformations, and sub-250ms CDC latency.
Your application infrastructure is already defined as code. Your cloud resources are already in Terraform. It is time your data pipelines caught up.