<--- Back to all resources
Migrating from Spark Structured Streaming to Apache Flink
A practical guide to migrating from Spark Structured Streaming to Apache Flink. Covers API differences, state migration challenges, checkpoint incompatibility, and a parallel running strategy.
If you have been running Spark Structured Streaming in production, you already know it works. Your jobs process data, your dashboards update, and your team has built muscle memory around the Spark ecosystem. So why would you consider moving to Apache Flink?
The short answer: latency. Spark Structured Streaming processes data in micro-batches, typically with a minimum trigger interval of 100 milliseconds (and in practice, often several seconds). Flink processes records one at a time as they arrive. For workloads where seconds matter, like fraud detection, inventory updates, or real-time personalization, that difference is significant.
This guide walks through the practical steps of migrating from Spark Structured Streaming to Flink, including the gotchas that documentation tends to gloss over.
Understanding the Execution Model Differences
Before writing a single line of Flink code, you need to internalize how differently the two engines think about streaming.
Micro-Batch vs Record-at-a-Time
Spark Structured Streaming treats a stream as a series of small batch jobs. Every trigger interval, it reads a chunk of new data from the source, processes it as a DataFrame, and writes the results. This design is elegant because it reuses the entire Spark SQL engine, but it introduces a floor on your latency.
Flink, on the other hand, builds a dataflow graph at job submission time. Records flow through operators continuously. There is no trigger interval, no micro-batch boundary, and no periodic scheduling overhead. When a record arrives at the source, it immediately begins flowing through your pipeline.
This matters for how you think about your code:
- Spark: Your logic runs “once per batch.” You can reason about it like a batch job that happens to repeat.
- Flink: Your logic runs “once per record.” Operators maintain long-lived state between records.
Time Semantics
Both engines support event time processing, but the implementations differ. In Spark, watermarks are computed per micro-batch based on the maximum event time seen in that batch. In Flink, watermarks are generated continuously and propagated through the operator graph.
If your Spark job uses withWatermark(), you will need to define a WatermarkStrategy in Flink. The concept is the same, but the configuration is different:
// Spark watermark
df.withWatermark("eventTime", "10 seconds")
// Flink watermark equivalent
WatermarkStrategy
.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner((event, timestamp) -> event.getEventTime());
The Flink version gives you more control. You can implement custom watermark generators, handle idle partitions, and align watermarks across parallel subtasks.
Mapping Spark APIs to Flink
The good news: many Spark Structured Streaming operations have direct Flink equivalents. The bad news: the ones that do not are usually the stateful ones, and those are the hardest to migrate.
DataFrame Operations to Flink SQL
If your Spark jobs are mostly SQL-like transformations (filters, joins, aggregations), Flink SQL is your best friend. The syntax is nearly identical:
-- Spark SQL
SELECT user_id, COUNT(*) as event_count
FROM events
WHERE event_type = 'click'
GROUP BY user_id, TUMBLE(event_time, INTERVAL '5' MINUTE)
-- Flink SQL (almost the same)
SELECT user_id, COUNT(*) as event_count
FROM events
WHERE event_type = 'click'
GROUP BY user_id, TUMBLE(event_time, INTERVAL '5' MINUTE)
Flink SQL supports tumbling windows, sliding windows, session windows, and cumulative windows. If you are using Spark’s groupBy().agg() pattern, the translation is usually straightforward.
Stateful Operations
This is where things get harder. In Spark, mapGroupsWithState and flatMapGroupsWithState let you maintain custom state per key. In Flink, the equivalents are KeyedProcessFunction and KeyedCoProcessFunction.
Here is a simplified comparison for a session tracking use case:
// Spark: flatMapGroupsWithState
def updateSessionState(
key: String,
events: Iterator[Event],
state: GroupState[SessionInfo]
): Iterator[SessionOutput] = {
val current = state.getOption.getOrElse(SessionInfo.empty)
val updated = events.foldLeft(current)(_.update(_))
state.update(updated)
if (updated.isExpired) {
state.remove()
Iterator(updated.toOutput)
} else Iterator.empty
}
// Flink: KeyedProcessFunction
public class SessionFunction
extends KeyedProcessFunction<String, Event, SessionOutput> {
private ValueState<SessionInfo> sessionState;
@Override
public void open(Configuration params) {
sessionState = getRuntimeContext().getState(
new ValueStateDescriptor<>("session", SessionInfo.class));
}
@Override
public void processElement(Event event, Context ctx, Collector<SessionOutput> out)
throws Exception {
SessionInfo current = sessionState.value();
if (current == null) current = SessionInfo.empty();
current.update(event);
sessionState.update(current);
// Register a timer for session expiration
ctx.timerService().registerEventTimeTimer(
event.getEventTime() + SESSION_TIMEOUT);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<SessionOutput> out) {
SessionInfo session = sessionState.value();
if (session != null && session.isExpired(timestamp)) {
out.collect(session.toOutput());
sessionState.clear();
}
}
}
The Flink version is more verbose, but it gives you explicit control over timers, which is something Spark’s GroupState handles implicitly through timeouts.
Source and Sink Connectors
Both engines have Kafka connectors, but they are configured differently:
// Spark Kafka source
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "events")
.load()
// Flink Kafka source
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers("broker:9092")
.setTopics("events")
.setGroupId("flink-consumer")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.build();
For sinks, Flink’s KafkaSink supports exactly-once semantics through two-phase commit, similar to Spark’s kafka format with checkpointLocation.
The Checkpoint Problem
This is the single biggest technical challenge of the migration. Spark checkpoints and Flink savepoints are not compatible. Period. There is no conversion tool, no adapter, no workaround that lets you take a Spark checkpoint directory and feed it to a Flink job.
Why They Are Incompatible
Spark stores checkpoint data as a combination of offset metadata (which source positions have been committed) and state store snapshots (the contents of your GroupState or aggregation buffers). These use Spark’s internal serialization format and are tied to the specific Spark version and state store implementation.
Flink stores savepoints as snapshots of its distributed state backends (typically RocksDB or the heap-based backend). The format includes operator IDs, key groups, and serialized state, all specific to the Flink runtime.
State Migration Strategy
If your Spark job is stateless (pure transformations, filters, projections), you are in luck. Just start the Flink job from the latest Kafka offsets, and you are done.
If your job is stateful, you have a few options:
Option 1: Accept a brief inconsistency window. Start the Flink job from current offsets and let it rebuild state from incoming data. For use cases like windowed aggregations, you will have incomplete windows for a short period.
Option 2: Export state to an intermediate store. Before switching, have your Spark job write its current state to a Kafka topic or database table. Your Flink job reads this state on startup to bootstrap itself:
// In your Flink job's open() method
if (isBootstrapping) {
Map<String, SessionInfo> initialState = loadFromDatabase();
for (Map.Entry<String, SessionInfo> entry : initialState.entrySet()) {
// Populate Flink state from external store
sessionState.update(entry.getValue());
}
}
Option 3: Replay from source. If your Kafka retention is long enough, start the Flink job from the earliest offset and let it rebuild state by processing the full history. This works well for compact topics.
The Parallel Running Strategy
Do not flip a switch from Spark to Flink overnight. Instead, run both systems in parallel. This is the safest approach and the one I recommend for any production migration.
Setting Up Dual Processing
The idea is simple: both your Spark job and your Flink job read from the same Kafka topics. Each writes to separate output destinations. You then compare the outputs to verify correctness.
┌──────────────┐ ┌─────────────────┐
│ Spark Job │────▶│ output_spark │
┌──────────┐ │ (existing) │ └─────────────────┘
│ Kafka │──────▶├──────────────┤
│ Topics │ │ Flink Job │ ┌─────────────────┐
└──────────┘ │ (new) │────▶│ output_flink │
└──────────────┘ └─────────────────┘
Make sure each system uses its own consumer group so they track offsets independently.
Validation Checklist
During the parallel run, check the following:
- Row counts: Do both systems produce the same number of output records over the same time window?
- Value correctness: For a sample of keys, do the aggregated values match?
- Latency: Measure end-to-end latency for both systems. The Flink job should consistently be faster.
- Resource usage: Compare CPU, memory, and network consumption. Flink may use less or more depending on your parallelism settings.
- Failure recovery: Kill a task manager or executor and observe recovery behavior.
Run the parallel setup for at least two weeks before cutting over. If your workload has monthly patterns, consider running longer.
Cutover Steps
Once you are confident in the Flink output:
- Stop writing to the Spark output destination.
- Point downstream consumers to the Flink output.
- Keep the Spark job running (but not writing) for another week as a safety net.
- Decommission the Spark job.
Common Gotchas
After helping teams through this migration several times, these are the issues that catch people off guard.
Serialization Differences
Spark uses its own internal serialization for shuffles and checkpoints. Flink defaults to its own type serialization system and can also use Kryo or Avro. If you have custom types, you will need to register serializers in Flink:
env.getConfig().registerTypeWithKryoSerializer(
MyCustomType.class, MyCustomSerializer.class);
Forgetting this step leads to mysterious runtime errors that are hard to debug.
Parallelism Is Not the Same as Partitions
In Spark, the number of shuffle partitions (spark.sql.shuffle.partitions) determines parallelism for aggregations. In Flink, parallelism is set per operator or globally and maps directly to the number of task slots used. A common mistake is setting Flink parallelism equal to your Spark shuffle partition count, which is almost always wrong. Start with the number of Kafka partitions and adjust from there.
Window Semantics Can Surprise You
Spark’s window() function in Structured Streaming creates tumbling or sliding windows, but the output behavior depends on the output mode (append, update, complete). Flink windows emit results when the watermark passes the window end time, and late data handling is configured explicitly:
stream
.keyBy(Event::getUserId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.allowedLateness(Time.minutes(1))
.sideOutputLateData(lateOutputTag)
.aggregate(new MyAggregateFunction());
Pay close attention to how you handle late data. The default behavior differs between the two engines.
Dependency Management
Spark jobs typically run on a cluster with a pre-installed Spark distribution. Flink jobs are usually packaged as fat JARs that include all dependencies. If you are using libraries that depend on Spark internals, they will not work in Flink. Audit your dependency tree early.
Operational Differences
Beyond the code, your operations playbook needs to change.
Monitoring
Spark has the Spark UI with its familiar stages, tasks, and DAG visualization. Flink has its own web UI showing the job graph, task managers, checkpointing status, and backpressure indicators. Your team will need to learn what healthy looks like in Flink’s metrics.
Key Flink metrics to watch:
numRecordsInPerSecond/numRecordsOutPerSecondper operatorcheckpointDurationandcheckpointSizeisBackPressuredon each task
Deployment
Spark jobs are typically submitted via spark-submit. Flink jobs can be submitted via the CLI (flink run), the REST API, or through a managed platform. If you are running on Kubernetes, Flink has a native Kubernetes operator that handles job lifecycle management.
For teams that do not want to manage Flink infrastructure directly, Streamkap offers managed Flink as part of its streaming platform. This eliminates the need to operate Flink clusters, configure checkpointing storage, or handle upgrades, letting you focus on the application logic rather than the infrastructure.
Upgrading Stateful Jobs
In Spark, upgrading a stateful streaming job usually means stopping the old job and starting the new one from the same checkpoint directory. In Flink, you take a savepoint of the running job, stop it, and restart the new version from that savepoint. The catch: your new job’s operator graph must be compatible with the savepoint. Adding or removing stateful operators can break savepoint compatibility if you do not assign stable UIDs to your operators:
stream
.keyBy(Event::getUserId)
.process(new SessionFunction())
.uid("session-tracker") // Always assign UIDs to stateful operators
.name("Session Tracker");
Forgetting .uid() is one of the most common Flink mistakes and will prevent you from restoring from savepoints after code changes.
When to Stay on Spark
Not every workload benefits from this migration. If your Spark Structured Streaming jobs are processing data with trigger intervals of 30 seconds or more and your business does not need lower latency, the migration cost may not be justified. Spark also has a stronger ecosystem for machine learning (MLlib) and graph processing (GraphX) if those are part of your pipeline.
The strongest case for migrating is when you need true low-latency processing (sub-second), when you have complex event processing requirements, or when your team is already investing in a Kafka-centric architecture where Flink is a natural fit.
If you are evaluating whether to self-manage Flink or use a managed service, Streamkap provides managed Flink alongside managed Kafka and real-time CDC connectors, giving you a single platform for your entire streaming pipeline without the operational overhead of running Flink yourself.