<--- Back to all resources

Engineering

February 25, 2026

10 min read

Data Partitioning Strategies for Streaming Pipelines

How to design partitioning strategies in streaming systems for downstream query performance. Covers Kafka topic partitioning, warehouse partitioning, partition key selection, repartitioning in Flink, and hot partition mitigation.

TL;DR: • Kafka partitioning controls parallelism and ordering; warehouse partitioning controls query pruning and storage layout. They serve different purposes and often need different keys. • Choosing a high-cardinality, evenly-distributed partition key prevents hot partitions that bottleneck your entire pipeline. • Repartitioning mid-stream with Flink lets you reorganize data for downstream systems without touching your source topics. • Monitor partition lag asymmetry as an early warning for hot partitions before they cause consumer group rebalances.

Partitioning is one of those topics that seems straightforward until you realize the same word means different things at different layers of your stack. In Kafka, a partition is a unit of parallelism and ordering. In Snowflake, it is a micro-partition that determines how data is physically clustered for query pruning. In Hive or Iceberg, it is a directory-level split by column values. Getting partitioning right at each layer determines whether your streaming pipeline runs smoothly or grinds to a halt under skewed load and slow queries.

This article covers how partitioning works across the streaming stack, how to choose keys that serve both real-time processing and downstream analytics, and how to fix the problems that arise when your initial partitioning strategy meets the reality of production data.

Kafka Partitioning: Parallelism and Ordering

A Kafka topic is divided into partitions, and each partition is an ordered, append-only log. Producers assign each record to a partition, and consumers in a consumer group are each assigned a subset of partitions. This is the fundamental mechanism for both parallelism (more partitions = more consumers processing simultaneously) and ordering guarantees (records within a partition are processed in order).

How Records Get Assigned to Partitions

When a producer sends a record, the partition is determined by one of three methods:

  1. Explicit partition assignment: The producer specifies the partition number directly. Rarely used in practice.
  2. Key-based hashing: If the record has a key, Kafka hashes it (using murmur2 by default) and mods by the partition count. Records with the same key always go to the same partition.
  3. Round-robin or sticky: If there is no key, Kafka distributes records across partitions. The sticky partitioner (default since Kafka 2.4) fills one batch before moving to the next, improving batching efficiency.

For CDC pipelines, the partition key is typically the primary key of the source table. This means all changes to a given row land in the same partition, preserving the order of INSERT, UPDATE, and DELETE operations for that row. This is important because out-of-order CDC events cause incorrect state materialization downstream.

Source Table (orders)         Kafka Topic (orders-cdc)
┌─────────┬────────┐         ┌─────────────────────┐
│ id: 101 │ INSERT │  ──→    │ Partition 3 (key=101)│
│ id: 101 │ UPDATE │  ──→    │ Partition 3 (key=101)│
│ id: 202 │ INSERT │  ──→    │ Partition 7 (key=202)│
│ id: 101 │ DELETE │  ──→    │ Partition 3 (key=101)│
└─────────┴────────┘         └─────────────────────┘

When you set up CDC with Streamkap, the primary key is used as the partition key by default. This guarantees per-row ordering without any additional configuration. If your use case requires a different partitioning scheme, you can override the key at the connector level.

Choosing the Right Partition Count

The number of partitions determines your maximum consumer parallelism. If a topic has 12 partitions, at most 12 consumers in a single consumer group can process data simultaneously. Beyond 12 consumers, the extras sit idle.

Rules of thumb:

  • Start with 2-3x your expected consumer count to give room for scaling.
  • Target 1-10 MB/s per partition for balanced throughput. If a single partition exceeds this, you probably need more partitions.
  • Keep total partitions per cluster under 10,000-20,000. Each partition adds metadata overhead on the Kafka controller. High partition counts increase leader election time during broker failures.

Do not over-partition. 1,000 partitions for a topic that processes 100 records per second wastes memory, increases rebalance time, and complicates monitoring.

Warehouse Partitioning: Query Performance

Once data lands in your warehouse or lakehouse, partitioning serves a completely different purpose: query pruning. A well-partitioned table lets the query engine skip reading irrelevant data files entirely.

Time-Based Partitioning

The most common warehouse partitioning strategy is by time (date or hour). Most analytical queries include a time range filter, so partitioning by date means the engine only reads files for the relevant dates.

In Snowflake, micro-partitioning happens automatically based on data ingestion order. Since streaming pipelines insert data roughly in time order, Snowflake’s natural clustering aligns well with time-based queries without explicit partitioning. However, if you bulk-load historical data alongside streaming data, the natural clustering gets disrupted, and you may need to define a clustering key explicitly.

For Iceberg tables (used with Spark, Trino, or Flink), you define the partition spec directly:

CREATE TABLE orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_total DECIMAL(10, 2),
    event_time TIMESTAMP
) PARTITIONED BY (days(event_time));

The Kafka Key vs. Warehouse Partition Mismatch

Here is where it gets interesting. Your Kafka topic is partitioned by customer_id (for per-customer ordering), but your warehouse table is partitioned by event_date (for time-range queries). These are different keys solving different problems, and that is perfectly fine.

The mistake is assuming they should match. I have seen teams partition their Kafka topics by date to “align with the warehouse,” destroying per-entity ordering in the process. Do not do this. Let each layer partition for its own needs.

If you need your streaming data to arrive at the warehouse already organized by date, that is the job of the ingestion layer. Streamkap handles this naturally: CDC data flows through Kafka partitioned by primary key (preserving ordering), and Streamkap’s Snowflake and BigQuery connectors write data in time-ordered batches that align with the warehouse’s preferred clustering.

Partition Key Selection Deep Dive

Choosing a partition key is one of the most consequential early decisions in pipeline design. A bad key choice echoes through every system that processes the data.

Criteria for a Good Partition Key

High cardinality. The key should have many distinct values. Partitioning a global e-commerce topic by country_code gives you 200-ish distinct values. If one country generates 60% of your traffic, one partition handles 60% of the work. Partitioning by order_id gives you millions of distinct values, distributed nearly uniformly.

Even distribution. Cardinality alone is not enough. If 10% of your customer IDs generate 90% of your events (think: a SaaS platform where one enterprise tenant dwarfs all others), customer_id partitioning creates hot partitions. You need to know your data distribution before choosing a key.

Stable over time. The key should not change for a given entity. If you partition by a field that gets updated (like status), the same entity’s events will scatter across partitions, breaking ordering.

Meaningful for downstream processing. If your Flink job needs to group by customer_id for sessionization, partitioning by customer_id means the data arrives pre-grouped, avoiding an expensive shuffle (repartition) in Flink.

Composite Keys

Sometimes a single field does not satisfy all criteria. A composite key combines multiple fields:

partition_key = hash(tenant_id + "_" + entity_id)

This is common in multi-tenant systems. Partitioning by entity_id alone might create hot partitions if entity IDs are sequential and the hash function produces clusters. Adding tenant_id as a prefix distributes the load more evenly.

Sometimes your Kafka partition key is wrong for a downstream consumer. The source topic is partitioned by order_id, but your Flink job needs to compute per-customer aggregates. Flink needs to repartition (shuffle) the data by customer_id.

Explicit KeyBy

In the Flink DataStream API, keyBy triggers a network shuffle that repartitions data by the specified key:

DataStream<Order> orders = env
    .addSource(kafkaSource)
    .keyBy(order -> order.getCustomerId())
    .process(new CustomerAggregation());

In Flink SQL, a GROUP BY or a join condition implicitly triggers repartitioning:

SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(order_total) AS total_spent
FROM orders_stream
GROUP BY customer_id;

Repartitioning is expensive. It involves serializing each record, sending it over the network to the correct task manager, and deserializing it. For high-throughput pipelines, this shuffle dominates your Flink job’s network I/O.

Writing to a Repartitioned Topic

If multiple downstream consumers need data partitioned differently, the efficient pattern is to repartition once and write to a new Kafka topic:

DataStream<Order> repartitioned = orders
    .keyBy(order -> order.getCustomerId());

KafkaSink<Order> sink = KafkaSink.<Order>builder()
    .setBootstrapServers("kafka:9092")
    .setRecordSerializer(new OrderSerializer("orders-by-customer"))
    .build();

repartitioned.sinkTo(sink);

Now the orders-by-customer topic is partitioned by customer ID, and any consumer reading from it gets pre-shuffled data. This avoids every consumer independently repartitioning the same data.

Hot Partitions: Detection and Mitigation

A hot partition is the streaming equivalent of a database hot row: one partition receives a disproportionate share of traffic, becoming a bottleneck for the entire consumer group.

Detecting Hot Partitions

Monitor these metrics per partition:

  • Consumer lag per partition: If partition 7 has 10 million records of lag while other partitions have 100,000, partition 7 is hot.
  • Bytes-in rate per partition: Available via Kafka JMX metrics or your monitoring platform.
  • Consumer processing time per partition: If one consumer in the group takes 10x longer to process its assigned partitions, check whether those partitions are receiving more data.

Set up alerts on lag asymmetry. A simple check: if max(partition_lag) / avg(partition_lag) > 5, investigate.

Mitigation Strategies

Add a salt to the key. If customer_id = "ACME_CORP" generates 50% of your traffic, hash customer_id + random_salt(0..N) to spread ACME’s events across N+1 partitions. The trade-off is that you lose per-entity ordering for that customer. If ordering matters, use a deterministic salt like customer_id + (sequence_number % N) so events for the same sequence land in the same partition.

Increase partition count. More partitions means the hash function has more buckets to distribute across. This helps when the skew is moderate (2-3x), but does not fix extreme skew (100x) since the same key still hashes to the same partition.

Separate hot entities. Route the high-volume entity to its own dedicated topic with its own consumer group. This isolates the hot traffic so it does not affect processing of other entities. You can configure this routing at the producer level or use Streamkap’s topic routing rules to split traffic based on field values.

Repartition by a different key. If the key is inherently skewed (like country_code), switch to a higher-cardinality key and use Flink to re-aggregate by the original key downstream. This spreads the Kafka-level load while still producing the grouping you need for analytics.

Partition Evolution

Your partitioning strategy will need to change as your data grows. Kafka makes it easy to increase partition count (just call kafka-topics --alter), but there is a catch: existing keys may hash to different partitions after the change. This means records for the same entity might now land in a different partition than before, breaking per-entity ordering for a brief period.

For warehouse tables using Iceberg, partition evolution is a first-class feature. You can change from daily to hourly partitioning without rewriting existing data:

ALTER TABLE orders SET PARTITION SPEC (hours(event_time));

New data uses the hourly spec while old data retains the daily spec. Iceberg handles this transparently at query time.

Plan for partition evolution from day one. Document your current partitioning strategy, the assumptions behind it, and the conditions under which it should change (e.g., “if any single partition exceeds 50 MB/s sustained, repartition the topic”). This turns a future emergency into a planned operation.

Tying It Together

Good partitioning is about matching the data organization to the access pattern at each layer. Kafka partitions optimize for parallelism and ordering. Warehouse partitions optimize for query pruning. Flink repartitioning bridges the gap when the two do not align.

The practical workflow for a Streamkap-based pipeline looks like this:

  1. Source CDC: Streamkap partitions by primary key automatically, preserving per-row ordering.
  2. Stream processing: Flink repartitions by whatever key your aggregation or join requires.
  3. Sink to warehouse: Streamkap writes in time-ordered batches that align with the warehouse’s natural clustering.
  4. Monitor: Track per-partition lag to catch skew early, before it becomes an outage.

The key takeaway is that partitioning is not a one-time decision. It is a layered strategy that evolves with your data volume, query patterns, and processing topology. Design each layer’s partitioning independently, monitor for skew, and plan for evolution.