<--- Back to all resources
The Hidden Costs of Self-Managed Kafka: What They Don't Tell You
Running your own Kafka clusters sounds simple until it isn't. Learn about the real operational costs, common failures, and staffing requirements of self-managed Apache Kafka.
Every Kafka tutorial starts the same way. You download the binary, run kafka-server-start.sh, produce a message, consume it, and think: “That wasn’t so bad.” Then you try to run it in production. That’s where the tutorial ends and the pain begins.
There’s a massive gap between “I got Kafka working on my laptop” and “Kafka is running reliably in production at 3am on a Saturday.” Most teams discover this gap the hard way - usually during an outage, usually under pressure, and usually after they’ve already committed to the architecture. This article is the guide I wish someone had handed me before I spent my first year operating Kafka clusters.
Broker Operations - The Never-Ending Tuning Session
Standing up a Kafka broker is straightforward. Keeping it healthy is a full-time job.
Start with JVM tuning. Kafka runs on the JVM, which means you’re now in the business of garbage collection tuning whether you like it or not. The default GC settings will work fine until they don’t - and when they don’t, you’ll see brokers pause for seconds at a time while the garbage collector runs a full stop-the-world collection. Most production clusters end up on G1GC with carefully tuned heap sizes (usually 6-8GB - bigger is not better here). You’ll spend time tuning -XX:MaxGCPauseMillis, -XX:InitiatingHeapOccupancyPercent, and a half-dozen other flags that you never expected to care about.
Then there’s OS-level tuning. Kafka depends heavily on the Linux page cache for read performance. That means you need to understand vm.dirty_ratio, vm.dirty_background_ratio, and how your filesystem handles flushing to disk. You’ll need to bump ulimit -n for file descriptors because Kafka opens a lot of files - one per log segment per partition per replica. A broker handling 500 partitions with 7-day retention can easily have tens of thousands of open file handles. Network buffers (net.core.rmem_max, net.core.wmem_max) need tuning too, because the defaults are sized for a web server, not a message broker pushing gigabytes per second.
Disk I/O is where things get really interesting. Kafka’s throughput is fundamentally bounded by disk write speed. You’ll wrestle with the choice between SSDs and HDDs, RAID configurations, filesystem selection (XFS tends to win), and mount options like noatime. And all of this is per broker - you’re maintaining this configuration across every node in the cluster.
ZooKeeper - The Dependency You Didn’t Sign Up For
For years, Kafka required Apache ZooKeeper for cluster coordination - leader election, configuration management, and tracking which brokers are alive. This means running Kafka actually means running two distributed systems. ZooKeeper is its own operational beast: it needs an odd number of nodes (usually 3 or 5), its own monitoring, its own disk management (transaction logs must be on separate disks from snapshots), and its own failure modes.
ZooKeeper session timeouts are a classic source of Kafka outages. If a broker can’t reach ZooKeeper within the session timeout window - due to a GC pause, network blip, or ZK being overloaded - the broker gets marked as dead. Partitions get reassigned. Consumers rebalance. Your pager goes off. And the broker was fine the whole time.
The Kafka community recognized this problem and introduced KRaft mode - Kafka without ZooKeeper, using an internal Raft-based consensus protocol. Great idea. Rough migration. Moving from ZooKeeper to KRaft in a running production cluster is a multi-step, multi-day procedure that requires careful planning, testing, and rollback preparation. If you’re running an older Kafka version, you might not even have KRaft support. If you’re running a newer version, you’re dealing with a feature that’s still maturing. Either way, you’re spending engineering time on cluster metadata infrastructure instead of building your product.
Partition Management - The Balancing Act That Never Ends
Partitions are Kafka’s unit of parallelism, and getting them right is an art form with real consequences for getting it wrong.
Choose too few partitions and you bottleneck throughput. Choose too many and you increase broker memory usage, slow down leader elections, and make rebalancing operations take forever. The “right” number depends on your throughput requirements, consumer count, retention period, and message size - and it changes as your workload grows.
Then there’s partition distribution. Kafka tries to spread partitions evenly across brokers, but over time - as you add topics, remove topics, add brokers, or lose brokers - things drift. You end up with “hot” brokers handling disproportionate load while others sit nearly idle. Fixing this requires partition reassignment, which means moving data between brokers over the network. On a cluster with terabytes of data, a reassignment operation can take hours or days and saturate your network bandwidth in the process.
Tools like Cruise Control exist to help automate rebalancing, but they’re another system to deploy, configure, monitor, and debug. The tooling helps, but it doesn’t eliminate the problem.
Disk Management - The Number One Operational Issue
If I had to pick one thing that causes the most Kafka outages, it’s disk. Full stop.
Kafka writes every message to disk. Messages accumulate based on your retention policy - time-based, size-based, or both. If you’re not watching disk usage per topic, per partition, and per broker, you will eventually fill a disk. When a disk fills up on a Kafka broker, the broker can’t write new messages. Producers start getting errors. If replication is writing to that broker, replicas fall behind. Under-replicated partitions appear. If enough brokers hit this state, you’re in a cascading failure.
The fix seems simple: set retention policies and monitor disk usage. In practice, it’s much harder. A single topic can have wildly different write rates depending on upstream behavior. A marketing campaign that doubles your event volume can eat through your disk headroom in hours. A schema change that increases message size by 30% quietly accelerates disk consumption. A compacted topic that suddenly gets a flood of unique keys can grow much faster than expected.
Tiered storage - offloading old segments to object storage like S3 - is the long-term answer, but it adds yet another layer of configuration, monitoring, and failure modes. You’re now operating a two-tier storage system with its own consistency guarantees and performance characteristics.
Security - The Checkbox That Takes Months
Kafka supports TLS for encryption in transit, SASL for authentication, and ACLs for authorization. Checking those boxes on a compliance questionnaire is easy. Actually implementing them is another story.
TLS certificate rotation on a multi-broker cluster requires rolling restarts. If your certificates expire before you rotate them - and yes, this happens more often than anyone admits - your cluster stops accepting connections. SASL/SCRAM configuration means managing credentials for every client, every service, and every environment. ACLs require defining per-topic, per-consumer-group permissions for every application that touches Kafka. In a microservices architecture with dozens of services, this becomes a management headache fast.
Encryption at rest is often handled at the disk level (LUKS, EBS encryption), but that’s still your responsibility to configure, verify, and audit. Audit logging for who accessed what topic and when requires additional configuration and log management infrastructure.
Every security feature adds operational overhead. Every rolling restart for certificate rotation is a window where things can go wrong. And the penalty for skipping any of it is a failed compliance audit or, worse, a data breach.
Upgrades - The Fear That Keeps You on Old Versions
Kafka upgrades should be straightforward: rolling restart with the new binary, update inter-broker protocol versions, done. In practice, upgrade fear is real, and it keeps a shocking number of clusters running versions that are years out of date.
The fear isn’t irrational. A rolling upgrade touches every broker in the cluster. Each restart temporarily reduces cluster capacity. If something goes wrong mid-upgrade - an incompatible configuration, a bug in the new version, a client that doesn’t handle the new protocol - you’re in a partial-upgrade state that’s hard to reason about and harder to roll back.
Inter-broker protocol versions and log message format versions add another layer of complexity. You have to upgrade the binaries first, then bump the protocol versions in a separate rolling restart. Skip a step or do them out of order, and you get undefined behavior. Client compatibility is yet another concern - older clients may not speak the new protocol, and you can’t always upgrade clients and brokers simultaneously.
The result? Teams put off upgrades. They run versions with known bugs and missing security patches because the risk of upgrading feels higher than the risk of staying put. This is a failure mode in itself - a slow drift toward an increasingly fragile system.
Monitoring and Alerting - The Metrics Firehose
Kafka exposes metrics through JMX, and “a lot of metrics” is an understatement. A single broker can expose thousands of JMX MBeans covering everything from request latency to log flush rates to network handler idle percentage.
Figuring out which metrics actually matter is half the battle. At minimum, you need to track:
- Under-replicated partitions - the single most important indicator of cluster health
- ISR shrink/expand rate - tells you if replicas are struggling to keep up
- Consumer group lag - are your consumers falling behind?
- Request handler idle percentage - is the broker overwhelmed?
- Log flush latency - is your disk keeping up with writes?
- Network handler idle percentage - are you hitting network thread limits?
Setting up the monitoring stack itself is a project: JMX exporter to expose metrics in Prometheus format, Prometheus to scrape and store them, Grafana for dashboards, and alerting rules that page you at 3am when something is actually wrong (and not just noisy). Tools like Burrow help with consumer lag monitoring. Cruise Control helps with cluster balancing. Each one is another service to deploy and maintain.
Getting the alert thresholds right takes iteration. Too sensitive and you get alert fatigue. Too loose and you miss real problems. The tuning process alone can take months.
Incident Stories - When Theory Meets Reality
These are the kinds of scenarios that make self-managed Kafka teams earn their salaries.
The Disk Full Cascade. A team set retention to 7 days across all topics. One topic started receiving 10x normal volume due to a upstream bug. Disk filled on one broker overnight. Replication to that broker started failing. Under-replicated partitions triggered reassignment. Reassignment increased disk usage on other brokers. Within two hours, two more brokers hit disk limits. The on-call engineer’s pager went off at 2:47am. Recovery took six hours of manually deleting log segments, throttling replication, and rebalancing partitions.
The Partition Reassignment During Peak. An engineer kicked off a partition reassignment on a Friday afternoon (never do this) to even out a hot broker. The reassignment started moving terabytes of data across the internal network. Broker-to-broker bandwidth spiked. Producer latency doubled. Consumer lag started climbing. The team had to throttle the reassignment to a crawl, turning a 2-hour operation into a 3-day operation that ran through the weekend.
The ZooKeeper Split-Brain. A network partition isolated one ZooKeeper node from the other two. The isolated node still had an active session with one Kafka broker. That broker thought it was still the leader for some partitions. The other brokers elected new leaders. Two brokers believed they were leader for the same partitions. Producers wrote to both. Data diverged. Reconciling the duplicate writes took a week of manual analysis.
The Consumer Rebalancing Storm. A deployment pipeline rolled out a new version of a consumer application. The rolling deployment caused consumers to join and leave the group rapidly. Each join/leave triggered a group rebalance. Each rebalance paused consumption for all consumers in the group. The group spent 45 minutes in a continuous rebalance cycle, processing zero messages while the backlog grew to millions.
The True Cost Calculation
Here’s the math most teams don’t do before choosing self-managed Kafka.
Infrastructure costs for a production-grade cluster (3 brokers, 3 ZooKeeper nodes, monitoring stack) in AWS: $5,000-$12,000/month depending on instance sizes, storage, and data transfer. Add a staging environment and you’re doubling that.
Engineering costs are where it gets real. You need at least 1-2 engineers with deep Kafka expertise spending meaningful time on operations. At fully loaded costs of $200K-$300K per engineer per year, that’s $200K-$600K in labor. These are senior engineers who could otherwise be building features your customers are asking for.
On-call costs are harder to quantify but very real. Night and weekend pages, the cognitive load of being tethered to your phone, the burnout that comes with it. Teams that don’t account for on-call costs in their TCO analysis are kidding themselves.
Opportunity cost is the biggest hidden expense. Every hour your senior engineers spend debugging a Kafka partition reassignment is an hour they’re not spending on the product. Over a year, this adds up to months of lost product development.
Add it all up: $60K-$144K/year in infrastructure plus $200K-$600K/year in engineering time plus on-call burden plus opportunity cost. A conservative total cost of ownership lands somewhere between $300K and $750K per year.
When Self-Managed Actually Makes Sense
It’s not all doom and gloom. There are legitimate reasons to run your own Kafka.
If you’re operating at massive scale - hundreds of brokers, petabytes of throughput - the economics can shift in favor of self-managed. At that scale, you likely already have a dedicated platform team with deep Kafka expertise. The per-message cost of managed services at those volumes may exceed what you’d spend running it yourself.
If you have strict compliance or data residency requirements that no managed provider can meet, self-managed may be your only option. Some regulated industries need complete control over where data lives and who can access it.
If you already have a mature platform engineering team that operates other distributed systems (Kubernetes, databases, etc.), adding Kafka to their portfolio is incremental rather than net-new. The operational muscle memory transfers.
But be honest with yourself about whether these conditions actually apply to your organization. Most teams that choose self-managed Kafka do so because they underestimate the operational burden, not because they’ve done the TCO analysis and concluded it makes economic sense.
The Managed Alternative
Managed Kafka removes the operational surface area described in this article. No broker provisioning. No JVM tuning. No disk monitoring. No ZooKeeper maintenance. No upgrade coordination. No 3am pages because a disk filled up.
Streamkap runs Kafka as part of its streaming data platform. The infrastructure is monitored, scaled, and maintained by a team that does nothing but operate streaming infrastructure. Brokers are patched, certificates are rotated, partitions are balanced, and disks are managed - all without you thinking about it.
The trade-off is straightforward: you pay a managed service fee and get your engineering hours back. For most teams, the managed service fee is a fraction of what they’d spend on engineering time to operate Kafka themselves. The engineers you free up go back to building the product that actually makes your company money.
If you’ve read this far and recognized your own team in any of these scenarios, it might be time to stop fighting infrastructure and start building on top of it. Talk to us about managed Kafka and get your weekends back.