<--- Back to all resources

Engineering

February 26, 2026

13 min read

Kafka on Kubernetes: Real-World Lessons

Running Kafka on Kubernetes sounds like a good idea until you hit storage, networking, and operational challenges. Here's what teams learn the hard way and how to avoid the common pitfalls.

TL;DR: Kafka on Kubernetes works, but it is not simple. The main challenges are persistent storage (EBS latency, PV reclaim policies, storage class tuning), broker discovery (headless services, advertised listeners, load balancer costs), rolling upgrades (pod disruption budgets, ISR awareness), and monitoring (JMX export, resource limits). Strimzi is the most mature operator. Before committing, honestly assess whether your team has the Kubernetes expertise to debug Kafka issues at the intersection of distributed systems and container orchestration.

Running Kafka on Kubernetes is one of those ideas that sounds perfectly logical in an architecture review. Your team already runs everything else on K8s. Why not Kafka too? The answer is that it can work, but Kafka pushes Kubernetes harder than most workloads, and the failure modes sit at the intersection of two complex distributed systems. This guide covers what teams actually encounter when they make the move, drawn from production deployments and the mistakes that informed them.

Why Teams Move Kafka to Kubernetes

The motivations are usually some combination of the following:

Standardization. If your platform team manages everything through Kubernetes, having Kafka on dedicated VMs or a separate managed service creates operational friction. Different deployment pipelines, different monitoring stacks, different access control patterns. Putting Kafka on K8s means one set of tools for everything.

Infrastructure as code. Kubernetes resources are declarative YAML. Your entire Kafka cluster configuration (broker count, storage, networking, topic defaults) lives in version-controlled manifests. Changes go through pull requests. Rollbacks are kubectl apply of a previous commit.

Bin-packing and cost. On dedicated VMs, Kafka brokers often run at 20-30% CPU utilization with bursts during rebalancing. On Kubernetes, those idle resources can be used by other workloads. The savings are real, especially at scale.

Dynamic scaling. Adding brokers means applying an updated manifest rather than provisioning new VMs, installing Kafka, configuring it, and joining the cluster manually.

These are legitimate benefits. The problem is that Kafka is stateful, disk-heavy, and network-sensitive. It is not a twelve-factor app that you can kill and restart anywhere. Each broker has a unique identity, owns specific topic partitions, stores data on disk that must survive restarts, and advertises a specific network address that clients connect to directly. Every one of those properties creates friction with Kubernetes defaults.

Choosing an Operator

You should not deploy Kafka on Kubernetes without an operator. The manual effort of managing StatefulSets, persistent volumes, configuration, and rolling upgrades for Kafka is substantial and error-prone. Three options dominate the space.

Strimzi

Strimzi is a CNCF Sandbox project and the most widely adopted open-source Kafka operator. It defines custom resources for Kafka clusters, topics, users, connectors, and mirror makers.

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: production-cluster
spec:
  kafka:
    version: 3.7.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
      - name: external
        port: 9094
        type: loadbalancer
        tls: true
    storage:
      type: persistent-claim
      size: 500Gi
      class: kafka-storage
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
    resources:
      requests:
        memory: 8Gi
        cpu: "2"
      limits:
        memory: 8Gi
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 100Gi
      class: kafka-storage

Strimzi handles rolling upgrades, certificate management, rack awareness, and cruise control integration. It is the default choice for teams that do not have a Confluent license.

Confluent for Kubernetes (CFK)

CFK is the official operator from Confluent. It deploys the full Confluent Platform (Schema Registry, Connect, ksqlDB, Control Center) alongside Kafka. If you are already paying for Confluent, CFK gives you tighter integration and commercial support. For teams running community Apache Kafka, CFK adds unnecessary complexity.

Bitnami Helm Charts

Bitnami provides a Helm chart that deploys Kafka as a StatefulSet. It is simpler than an operator and works for development or small deployments. What it lacks is the operational automation: it will not do ISR-aware rolling upgrades, automatic certificate rotation, or topic management through custom resources. Most teams that start with Bitnami charts migrate to Strimzi once they hit production.

Storage: The First Hard Problem

Kafka brokers write every message to disk. Throughput and latency depend directly on storage performance. On bare metal or VMs, you control the disk hardware. On Kubernetes, you are working through the persistent volume abstraction, which adds a layer of indirection that needs careful configuration.

Storage Class Configuration

The default storage class on most cloud providers is not tuned for Kafka workloads. On AWS, the default is usually gp2, which caps at 3,000 IOPS for volumes under 1 TB and has higher latency than gp3. On GCP, the default pd-standard uses spinning disks.

Create a dedicated storage class for Kafka:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: kafka-storage
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "6000"
  throughput: "400"
  encrypted: "true"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Key settings to get right:

  • reclaimPolicy: Retain is non-negotiable. The default Delete policy destroys the EBS volume when the PVC is deleted. If a StatefulSet is accidentally deleted or recreated, your data is gone. With Retain, the volume persists and can be reattached.
  • volumeBindingMode: WaitForFirstConsumer ensures the volume is created in the same availability zone as the pod. Without this, you can end up with a volume in us-east-1a and a pod scheduled in us-east-1b, which means the pod cannot mount its volume and stays in Pending forever.
  • allowVolumeExpansion: true lets you resize volumes later. Some CSI drivers require a pod restart to pick up the expanded volume, so test this before you need it in production.

Performance Expectations

Network-attached storage (EBS, pd-ssd) introduces latency that you do not see on local disks. Typical numbers:

Storage TypeWrite Latency (p99)Sequential Throughput
Local NVMe SSD0.1-0.5 ms1-3 GB/s
AWS gp3 (6K IOPS)1-3 ms400 MB/s
AWS io2 (provisioned)0.5-1 msup to 4 GB/s
GCP pd-ssd1-2 msup to 1.2 GB/s

For most Kafka workloads, gp3 with provisioned IOPS is the sweet spot between cost and performance. If you need sub-millisecond latency, local SSDs are the only option, but they introduce a tradeoff: local storage is ephemeral. If the node fails, the data on that disk is lost. This is safe only if your replication factor is 3 or higher and your min.insync.replicas gives you room to lose a broker.

Volume Sizing

Size your volumes for the maximum data you will store, not the average. Kafka retention is time-based or size-based, and disk usage peaks during rebalancing when partitions are being replicated to new brokers. A good rule of thumb: provision 2x your expected steady-state storage to handle rebalancing and bursts.

Expanding PVs after the fact is possible but disruptive on some providers. On AWS with the EBS CSI driver, volume expansion happens online. On other providers, the pod may need to be restarted. Do not count on live expansion working without testing it in your specific environment first.

Networking: The Second Hard Problem

Kafka’s networking model predates Kubernetes by years. Clients connect to any broker, receive metadata about the cluster, and then connect directly to the broker that leads the partition they want to produce to or consume from. This direct-to-broker communication is fundamental to Kafka’s design and creates the biggest headache on Kubernetes.

Internal Access

For clients running inside the same Kubernetes cluster, the standard approach is a headless service. Strimzi creates this automatically:

apiVersion: v1
kind: Service
metadata:
  name: production-cluster-kafka-brokers
spec:
  clusterIP: None
  selector:
    strimzi.io/name: production-cluster-kafka
  ports:
    - name: tcp-clients
      port: 9092

Each broker gets a stable DNS name following the pattern <cluster>-kafka-<ordinal>.<cluster>-kafka-brokers.<namespace>.svc.cluster.local. Kafka’s advertised.listeners configuration is set to these DNS names so that when a client receives metadata, the broker addresses resolve correctly inside the cluster.

This works well. The problems start when you need access from outside the cluster.

External Access

External clients (applications running outside Kubernetes, monitoring tools, other clusters) need to reach individual brokers at routable addresses. Three patterns exist, and none is ideal:

LoadBalancer per broker. Each broker gets its own cloud load balancer. On AWS, this means one NLB per broker. A 3-broker cluster creates 3 load balancers, each costing roughly $16/month plus data transfer. A 9-broker cluster creates 9 load balancers. The cost scales linearly and the per-broker load balancer feels wasteful since each one handles traffic for a single pod.

NodePort. Each broker exposes a unique port on every node in the cluster. Clients connect to <any-node-ip>:<node-port>. This avoids load balancer costs but exposes high-numbered ports on your nodes, which may conflict with security policies. NodePorts also change if the service is recreated, breaking client configurations.

Ingress with TCP support. Some Ingress controllers (like ingress-nginx with TCP ConfigMap or Envoy-based controllers) can route TCP traffic by port or SNI. This is the most elegant solution but requires an Ingress controller that supports non-HTTP protocols, which not all do. Configuration is more involved.

The most common failure teams hit is a mismatch between advertised.listeners and the actual address clients can reach. The broker tells the client “connect to me at X” but X does not resolve or is not reachable from the client’s network. Symptoms: the client connects to the bootstrap server successfully, receives metadata, then hangs trying to connect to individual brokers. If you see this pattern, check advertised.listeners first.

Strimzi handles advertised listener configuration automatically for each listener type, which is one of the strongest reasons to use an operator rather than manual configuration.

Operational Realities

Day-to-day operations are where Kafka on Kubernetes either proves its value or becomes a constant source of pain.

Rolling Upgrades

Upgrading Kafka brokers requires restarting each one sequentially while maintaining cluster availability. Strimzi automates this process and makes it ISR-aware: it will not restart a broker if doing so would cause an under-replicated partition. But you need to configure the safety mechanisms:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      strimzi.io/name: production-cluster-kafka

The PodDisruptionBudget ensures that Kubernetes itself (node draining, cluster autoscaler) will not evict more than one broker at a time. Without this, a node drain could kill multiple brokers simultaneously and cause data loss if replication factor is 3.

For controlled shutdowns, set controlled.shutdown.enable=true (it is true by default) and give it enough time:

config:
  controlled.shutdown.max.retries: 3
  controlled.shutdown.retry.backoff.ms: 5000

This gives Kafka time to transfer partition leadership away from the broker that is shutting down before it actually stops, reducing the impact on producing and consuming clients.

Resource Limits

CPU throttling is one of the most insidious Kafka problems on Kubernetes. When a pod hits its CPU limit, the kernel throttles it. For Kafka, throttling means slower request handling, which increases produce and fetch latency, which causes clients to time out, which triggers rebalancing, which creates more CPU load. It is a cascading failure pattern.

The safest approach is to set requests equal to limits for both CPU and memory, giving Kafka a guaranteed QoS class:

resources:
  requests:
    memory: 8Gi
    cpu: "4"
  limits:
    memory: 8Gi
    cpu: "4"

Setting requests lower than limits (burstable QoS) saves money but risks throttling under load. For Kafka brokers, the guaranteed QoS class is worth the cost.

Memory is more predictable. Kafka uses the OS page cache heavily, so the broker’s JVM heap is typically only 4-6 GB. The remaining memory in the pod’s allocation goes to page cache. A broker with 8Gi limit running a 4 GB heap has 4 GB of page cache, which significantly improves read performance for consumers that are near the tail of the log.

JMX Monitoring

Kafka exposes metrics through JMX, which does not play nicely with Kubernetes monitoring (Prometheus). The standard pattern is a JMX exporter sidecar or init container. Strimzi supports this natively:

kafka:
  metricsConfig:
    type: jmxPrometheusExporter
    valueFrom:
      configMapKeyRef:
        name: kafka-metrics
        key: kafka-metrics-config.yml

Key metrics to track:

  • kafka_server_BrokerTopicMetrics_MessagesInPerSec (throughput)
  • kafka_server_ReplicaManager_UnderReplicatedPartitions (replication health)
  • kafka_network_RequestMetrics_TotalTimeMs (request latency by type)
  • kafka_server_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent (saturation)
  • kafka_log_LogFlushRateAndTimeMs (storage performance)

Set alerts on UnderReplicatedPartitions > 0 and RequestHandlerAvgIdlePercent < 0.3. These are your early warnings that something is going wrong.

Log Retention and Disk Pressure

Kafka does not know about Kubernetes disk pressure eviction. If a broker’s persistent volume fills up (because retention was set too long or a burst of traffic exceeded expectations), Kafka stops accepting writes on the affected broker. Meanwhile, Kubernetes may or may not detect this as a problem depending on your monitoring.

Set up alerting on volume utilization at 70% and 85% thresholds. The kubelet’s eviction thresholds are for the node’s root filesystem, not for mounted PVs, so Kubernetes itself will not evict the pod when a PV fills up.

Common Failure Modes

Knowing the failure modes in advance saves hours of debugging.

Pod eviction during log compaction. Log compaction is memory-intensive. If a broker’s memory limit is set too tight, compaction threads push the JVM into aggressive garbage collection or the process gets OOMKilled. The pod restarts, compaction starts again, and the cycle repeats. Fix: increase memory limits or tune log.cleaner.dedupe.buffer.size and log.cleaner.threads.

OOMKilled from page cache pressure. This is counterintuitive. The Linux kernel counts page cache against a container’s memory cgroup. A broker doing heavy sequential reads (consumer catchup) can fill the page cache, and if the memory limit is too close to heap + expected page cache usage, the OOM killer fires. Fix: add headroom above heap + expected page cache. If your heap is 4 GB, set the limit to at least 8 GB.

Split-brain from aggressive liveness probes. If liveness probe timeouts are too short, a broker that is slow (due to a GC pause, disk flush, or rebalancing load) gets killed and restarted. This makes the problem worse. Set liveness probe timeouts generously:

livenessProbe:
  initialDelaySeconds: 60
  timeoutSeconds: 10
  periodSeconds: 30
  failureThreshold: 5

A total failure detection time of 150 seconds (30s period x 5 failures) is appropriate for Kafka. Brokers are designed to tolerate temporary unavailability through replication.

PV stuck in Released state. When a PVC is deleted but the PV has reclaimPolicy: Retain, the PV enters a Released state and cannot be automatically rebound to a new PVC. If you recreate the StatefulSet, the new PVCs create new PVs instead of reusing the old ones. To reattach, you need to manually edit the PV to remove the claimRef field:

kubectl patch pv <pv-name> -p '{"spec":{"claimRef": null}}'

Then recreate the PVC with the same name as the original, and it will bind to the existing PV.

DNS resolution delays during scaling. When adding new brokers, the new pods get DNS entries through the headless service. DNS caching in the cluster (CoreDNS) can cause a delay before other pods and clients resolve the new broker addresses. This is usually 30 seconds or less, but during that window, metadata requests may return stale broker lists. Clients handle this through retries, but it can cause transient errors in applications with aggressive timeout settings.

When Kafka on Kubernetes Makes Sense (and When It Does Not)

Kafka on Kubernetes is a reasonable choice when several conditions are true at the same time: your team already operates stateful workloads on Kubernetes (databases, Elasticsearch, or similar), you have engineers who can debug both Kafka and Kubernetes issues, and you are willing to invest in proper operator configuration, storage tuning, and monitoring.

It is a poor fit when Kafka would be your first or only stateful workload on Kubernetes. The operational complexity is high, and you will be learning Kubernetes storage, networking, and scheduling semantics through the lens of Kafka production incidents. That is an expensive way to learn.

It is also a poor fit when Kafka is primarily transport infrastructure for another purpose, such as change data capture. If your Kafka cluster exists to move database changes to a data warehouse or analytics platform, the operational cost of running Kafka (on Kubernetes or anywhere else) is overhead that does not add business value. Managed services and CDC platforms exist specifically to absorb that complexity so your team can focus on what the data is used for rather than how it moves.

The honest assessment comes down to one question: when something breaks at 2 AM, does your on-call engineer have the skills to diagnose whether the problem is Kafka, Kubernetes, the storage layer, or the networking layer? If the answer is yes, Kafka on Kubernetes can serve you well. If the answer is “we’ll figure it out,” consider whether the standardization benefits are worth the operational risk.