<--- Back to all resources
Streaming Pipeline Cost Optimization: Getting More for Less
A practical guide to reducing the cost of real-time streaming pipelines. Covers infrastructure sizing, partition tuning, compression, tiered storage, managed vs self-hosted cost tradeoffs, and monitoring spend.
Running a real-time streaming pipeline is not free. Kafka brokers, connectors, stream processors, and destination sinks all consume compute, storage, and network bandwidth. For small pipelines, these costs are negligible. But as you scale to dozens of topics, hundreds of partitions, and terabytes of daily throughput, the monthly bill starts demanding attention.
The good news is that most streaming pipelines have significant cost optimization opportunities hiding in plain sight. This guide covers the practical techniques that make the biggest difference, from infrastructure sizing to compression to the managed-vs-self-hosted decision.
Infrastructure Sizing: Stop Over-Provisioning
The most common cost mistake in streaming is over-provisioning Kafka brokers. Teams estimate peak throughput, add a generous safety margin, pick a large instance type, and never revisit the decision. Six months later, those brokers are running at 15% CPU utilization while burning through budget.
Measure Before You Size
Before choosing instance types, collect actual metrics from your pipeline. If you are already running, look at the last 30 days of data for:
- CPU utilization. Kafka is surprisingly CPU-light for most workloads. Unless you are doing heavy compression or TLS termination on the broker, you rarely need more than 4-8 cores per broker.
- Network throughput. This is often the real bottleneck. Each message is received by the broker (ingress), replicated to follower brokers (internal traffic), and served to consumers (egress). The total network load is roughly
ingress * (1 + replication_factor + consumer_count). Size your instances for network bandwidth first. - Disk throughput and IOPS. Kafka is a sequential write workload, which is forgiving on disk. But if you are running on EBS volumes in AWS, check your IOPS consumption. You might be paying for gp3 volumes when io2 is not needed, or you might be hitting IOPS limits on gp2 volumes and paying the latency penalty.
- Memory. Kafka uses heap memory for the JVM (6-8 GB is typical) and the rest for the OS page cache. More page cache means more data served from memory rather than disk. 32 GB of total RAM per broker is a common sweet spot, but measure your cache hit rate to verify.
Right-Size Your Instance Types
Cloud providers offer a bewildering array of instance types. For Kafka brokers, you almost always want a network-optimized or storage-optimized instance, not a general-purpose one.
On AWS, for example:
| Workload | Good Fit | Why |
|---|---|---|
| Moderate throughput (< 200 MB/s per broker) | m6i.xlarge or m6i.2xlarge | Balanced compute and network |
| High throughput (200-500 MB/s per broker) | r6i.2xlarge or r6i.4xlarge | More memory for page cache |
| Very high throughput (> 500 MB/s per broker) | i3en.xlarge or i3en.2xlarge | NVMe local storage, high network |
The pattern is similar on GCP and Azure. The key is to pick the smallest instance that comfortably handles your measured workload, with enough headroom for traffic spikes. A 30-40% buffer above your p95 utilization is reasonable. Anything more is wasted spend.
Scale Horizontally, Not Vertically
If you need more capacity, add brokers rather than upsizing existing ones. Kafka is designed to scale horizontally. Adding a broker and rebalancing partitions across the cluster is cleaner and often cheaper than jumping to a larger instance type. Four m6i.xlarge brokers frequently cost less than two m6i.4xlarge brokers while providing better fault tolerance.
Partition Count Tuning
Partitions are Kafka’s unit of parallelism. More partitions let you run more consumer instances in parallel, which increases throughput. But each partition has a cost, and over-partitioning is one of the most common sources of waste.
The Cost of Each Partition
Each partition on a broker consumes:
- Memory. Index files and log segment metadata are cached in memory. With thousands of partitions per broker, this adds up.
- File handles. Each partition has multiple open file handles for its log segments. Linux defaults to 1024 open files per process, which is nowhere near enough for a busy broker. You need to increase this, and even after increasing it, each handle has a small cost.
- Replication bandwidth. Every partition is replicated to
replication_factor - 1other brokers. More partitions mean more replication traffic. - Leader election time. When a broker fails, the controller elects new leaders for all the partitions that broker was leading. With 10,000 partitions, this can take minutes, during which those partitions are unavailable.
Guidelines for Partition Count
A good starting point is to target 1 MB/s of throughput per partition. If your topic receives 10 MB/s of data, 10-12 partitions is a reasonable starting point. If your topic receives 100 KB/s, you probably need only 1-3 partitions.
The other factor is consumer parallelism. If you have 8 consumer instances in a consumer group, you need at least 8 partitions for the topic, or some consumers will sit idle. But you do not need 100 partitions “just in case.” Adding partitions later is possible (though it does break key-based ordering guarantees for existing keys), so start conservatively.
Audit your existing topics. It is common to find topics with 50 or 100 partitions that receive a trickle of data. These are prime candidates for consolidation. Reducing partition count on existing topics requires creating a new topic with fewer partitions and migrating consumers, which is why getting the count right from the start matters.
Compression: Cheap and Effective
Enabling compression on your Kafka producers is one of the highest-impact, lowest-effort optimizations you can make. Compressed messages use less network bandwidth, less disk space, and less replication traffic. The only cost is CPU cycles on the producer and consumer, and for modern codecs, that cost is minimal.
Choosing a Codec
| Codec | Compression Ratio | CPU Cost | Best For |
|---|---|---|---|
| GZIP | High (70-80% reduction) | High | Cold storage, archival topics |
| Snappy | Moderate (50-60% reduction) | Low | General-purpose, latency-sensitive |
| LZ4 | Moderate (50-60% reduction) | Very low | High-throughput, latency-sensitive |
| ZSTD | High (65-75% reduction) | Moderate | Best balance of ratio and speed |
For most production workloads, ZSTD or LZ4 are the right choices. ZSTD gives you better compression ratios with acceptable CPU overhead, which translates directly into lower storage and network costs. LZ4 is better when you are CPU-constrained or need the absolute lowest producer latency.
Set compression at the producer level:
compression.type=zstd
The broker can be configured to preserve the producer’s compression or recompress, but in most cases you want the broker to accept the compressed batches as-is to avoid spending broker CPU on recompression.
Measure the Impact
After enabling compression, monitor:
- Broker disk usage. You should see a meaningful reduction in daily growth rate.
- Network throughput. Both ingress and replication traffic should drop.
- Producer and consumer CPU. A small increase is expected. If it is more than 5-10%, consider switching to a lighter codec.
For JSON and Avro payloads, compression ratios of 60-80% are common. That means your 100 GB/day topic might use only 20-40 GB/day of disk after compression. Over a year, the storage savings alone can be significant.
Retention Policy and Tiered Storage
Kafka’s default retention is 7 days (retention.ms=604800000). For many topics, this is either too long or too short. Getting retention right is a direct lever on storage cost.
Setting Retention Based on Consumer Needs
Ask yourself: how far back does any consumer actually need to read? If your fastest consumer keeps up in real time and your slowest consumer is a batch job that runs every 6 hours, you need at most 12-24 hours of retention (with some buffer for failures and reprocessing).
Reducing retention from 7 days to 1 day on a high-volume topic cuts your storage requirement by roughly 85%. For a topic producing 50 GB/day, that is the difference between 350 GB and 50 GB on disk.
For topics where some consumers need recent data in real time but you also want to keep a long-term archive, do not extend Kafka retention to 30 or 90 days. That is what object storage is for.
Tiered Storage
Tiered storage is a feature (available in Confluent Platform, Apache Kafka 3.6+, and Redpanda) that automatically moves older log segments from broker-local disk to object storage (S3, GCS, Azure Blob).
The economics are straightforward:
| Storage Type | Approximate Cost (AWS, per GB/month) |
|---|---|
| EBS gp3 | $0.08 |
| EBS io2 | $0.125 |
| S3 Standard | $0.023 |
| S3 Infrequent Access | $0.0125 |
Moving cold data from EBS to S3 reduces your storage cost by 3-6x. The tradeoff is that reading old data from S3 is slower than reading from local disk, but consumers that need old data typically tolerate higher latency.
If you are running a managed platform like Streamkap, tiered storage is handled for you. You get the cost benefit without needing to configure broker storage policies, object storage buckets, or lifecycle rules.
Managed vs. Self-Hosted: The Real Cost Comparison
The monthly cloud bill for your Kafka brokers is the most visible cost, but it is not the largest. The largest cost is almost always the engineering time required to keep the cluster healthy.
Hidden Costs of Self-Hosted Kafka
Here is what the cloud bill does not show:
- Upgrades. Kafka releases security patches and bug fixes regularly. Rolling upgrades across a multi-broker cluster require planning, testing, and execution. Budget 4-8 hours of engineering time per upgrade, and you should be upgrading at least quarterly.
- Capacity planning. Predicting when you need to add brokers, expand disk, or rebalance partitions requires ongoing monitoring and analysis. Getting this wrong means either wasted resources or an emergency scramble.
- On-call. Kafka clusters need 24/7 monitoring. A broker going down at 3 AM triggers a page. The engineer on call investigates, restarts the broker, waits for partition reassignment, and verifies data integrity. That is 2-4 hours of disrupted sleep.
- Connector management. If you are running Kafka Connect for CDC, you need to manage connector configurations, handle task failures, monitor consumer lag, and troubleshoot deserialization errors. Each connector is another thing that can break.
- Security and compliance. TLS certificates, SASL authentication, ACLs, encryption at rest, audit logging. Each of these requires setup and ongoing maintenance.
A Rough Cost Model
Let us compare a self-hosted Kafka cluster against a managed platform for a moderate workload: 50 MB/s sustained throughput, 20 topics, 3-day retention.
Self-hosted (AWS):
| Item | Monthly Cost |
|---|---|
| 3x r6i.2xlarge brokers (on-demand) | $2,700 |
| 3x 1 TB gp3 EBS volumes | $240 |
| 3x m6i.large for Kafka Connect | $700 |
| Data transfer (cross-AZ replication) | $500 |
| Engineering time (0.5 FTE at $180k/year) | $7,500 |
| Total | ~$11,640 |
That 0.5 FTE is conservative. It accounts for a senior engineer spending about half their time on Kafka operations: upgrades, monitoring, capacity planning, on-call, and connector management. For many teams, the actual time commitment is higher.
Managed platform (e.g., Streamkap):
A managed CDC and streaming platform typically charges based on data volume or connector count. For this workload, monthly costs are generally in the $2,000-$5,000 range, depending on the specific platform and tier.
The managed option often comes in at 30-60% less than self-hosted once you factor in engineering time. And the comparison gets more favorable for the managed option as your pipeline grows, because the engineering time for self-hosted scales with cluster complexity while managed pricing scales more linearly with data volume.
Monitoring and Controlling Spend
Cost optimization is not a one-time project. It requires ongoing visibility into where your money goes and the ability to catch runaway costs before they hit the monthly bill.
Key Metrics for Cost Visibility
Set up dashboards that track:
- Broker disk usage by topic. Identify which topics consume the most storage. Often a small number of high-volume topics account for 80% of disk usage.
- Partition count by topic. Flag topics with high partition counts but low throughput.
- Consumer lag by consumer group. Persistent lag means you might need more consumer instances (additional cost) or your processing logic needs optimization (engineering time).
- Network throughput by broker. Uneven traffic across brokers indicates a partition imbalance that wastes capacity on underutilized brokers.
- Compression ratio by topic. Topics with low compression ratios might benefit from a different codec or payload format.
Cost Alerts
Set alerts for:
- Disk usage exceeding 70% on any broker. At this point, you need to either add storage, reduce retention, or add brokers. Acting early avoids emergency scaling at premium on-demand prices.
- Topic creation. Every new topic consumes resources. Require team leads to approve new topics and specify a retention policy and partition count based on expected throughput.
- Unused consumer groups. Consumer groups that are registered but not actively consuming waste partition assignment metadata and can slow down rebalances. Clean these up regularly.
Regular Cost Reviews
Schedule a monthly or quarterly review of your streaming infrastructure costs. Walk through:
- Top 10 topics by storage. Are retention settings still appropriate? Can any topics be compressed more aggressively?
- Broker utilization. Are any brokers consistently below 30% CPU or network? Consider consolidating or downsizing.
- Connector health. Are any connectors frequently restarting? Unstable connectors cause duplicate processing and wasted compute.
- Engineering time spent on operations. Track this honestly. If your team is spending more than a few hours per week on Kafka operations, the case for a managed platform gets stronger.
Putting It All Together
Here is a prioritized checklist for optimizing your streaming pipeline costs, ordered by impact and effort:
- Enable compression (ZSTD or LZ4) on all producers. High impact, low effort. Do this first.
- Audit retention policies. Reduce retention on topics where consumers do not need 7 days of history. High impact, low effort.
- Audit partition counts. Identify over-partitioned topics and consolidate where possible. Medium impact, medium effort.
- Right-size broker instances. Collect utilization metrics and downsize if appropriate. High impact, medium effort.
- Enable tiered storage. Move cold data to object storage. High impact for long-retention topics, medium effort.
- Evaluate managed platforms. Calculate your true total cost of ownership including engineering time. If self-hosted TCO exceeds what a managed platform like Streamkap charges, the switch pays for itself.
- Set up cost monitoring dashboards and alerts. Prevent cost creep by catching problems early. Medium impact, low ongoing effort.
Streaming pipelines do not have to be expensive. The teams that spend the least per GB of throughput are the ones that measure continuously, size based on evidence rather than guesswork, and make deliberate tradeoffs between cost, latency, and durability. Start with the low-hanging fruit, measure the results, and iterate.