<--- Back to all resources

Engineering

February 25, 2026

8 min read

Flink Savepoints vs Checkpoints: When to Use Each

Understand the difference between Flink savepoints and checkpoints. Learn when to use each for upgrades, migrations, scaling, and disaster recovery in production.

TL;DR: • Checkpoints are automatic, periodic, and managed by Flink for fault tolerance - they are created and deleted without user intervention. • Savepoints are manual, user-triggered snapshots used for planned operations like job upgrades, scaling changes, and migrations - they persist until explicitly deleted. • Use checkpoints for crash recovery and savepoints for planned maintenance - they use the same underlying format but serve different operational purposes.

Apache Flink provides two snapshot mechanisms for capturing the state of a running job: checkpoints and savepoints. Both produce a consistent image of operator state, keyed state, and in-flight records at a single point in time. Despite sharing the same underlying snapshot format, they exist for fundamentally different reasons. Checkpoints keep your job alive when things go wrong. Savepoints let you change your job when things are going right. Understanding when and how to use each one is essential for operating Flink in production.

Checkpoints: Automatic Fault Tolerance

Checkpoints are Flink’s built-in mechanism for crash recovery. When you enable checkpointing on a job, the JobManager periodically triggers a snapshot of the entire distributed state. Each TaskManager writes its local state to durable storage (typically an object store like S3 or HDFS), and Flink coordinates the process using an asynchronous barrier alignment protocol that minimizes impact on throughput.

Key characteristics of checkpoints:

  • Automatic and periodic. Flink creates them on a configurable interval (e.g., every 60 seconds) without any manual intervention.
  • Managed lifecycle. Flink controls creation, retention, and deletion. Old checkpoints are garbage-collected once newer ones complete. You configure how many recent checkpoints to retain, but you do not manage individual files.
  • Incremental support. With the RocksDB state backend, checkpoints can be incremental - only writing state that changed since the last checkpoint. This dramatically reduces checkpoint size and I/O for large-state jobs.
  • Tied to a running job. By default, checkpoints are deleted when a job is canceled. You can configure externalized checkpoints with a retention policy (RETAIN_ON_CANCELLATION) to keep them, but this is opt-in.

Enable checkpointing in your job with a few lines of configuration:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60000); // every 60 seconds
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(30000);
env.getCheckpointConfig().setCheckpointTimeout(120000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig().setExternalizedCheckpointCleanup(
    ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
);

When a TaskManager crashes, the JobManager restores the job from the latest completed checkpoint. The process is fully automatic - operators reload their state, Kafka offsets rewind to the checkpointed position, and processing resumes with exactly-once guarantees. No human intervention is required.

Savepoints: Manual Snapshots for Planned Operations

Savepoints are user-triggered snapshots. They capture the same state as a checkpoint - operator state, keyed state, and connector offsets - but they are created on demand and persist until you explicitly delete them. Flink never garbage-collects a savepoint.

Key characteristics of savepoints:

  • Manually triggered. You decide when to create one. Flink will not create savepoints on its own.
  • Persistent by default. A savepoint remains in storage indefinitely. You are responsible for cleanup.
  • Portable format. Savepoints always use the canonical (non-incremental) state format, making them portable across state backends, cluster configurations, and even Flink versions (within compatibility bounds).
  • Independent of the running job. A savepoint is a standalone artifact. You can use it to start a new job, restore an old job, or fork a job into two parallel instances.

Savepoints are the mechanism you reach for whenever you need to make a deliberate change to a running pipeline - upgrading code, changing parallelism, migrating clusters, or testing a new version side-by-side with the old one.

Comparison Table

FeatureCheckpointSavepoint
TriggerAutomatic (periodic)Manual (user-initiated)
PurposeCrash recoveryPlanned operations
LifecycleManaged by Flink (auto-deleted)Managed by user (persists until deleted)
FormatMay be incrementalAlways canonical (full)
PortabilityTied to state backend and directoryPortable across backends and clusters
Creation costLow (incremental)Higher (full snapshot)
RetentionConfigurable count; garbage-collectedIndefinite; explicit deletion required
Flink version upgradesNot guaranteed across versionsDesigned for cross-version compatibility

When to Use Checkpoints

Checkpoints are the right tool for everything that happens without your involvement:

  • Crash recovery. A node dies, a container is evicted, or a network partition occurs. Flink automatically restores from the latest checkpoint.
  • Transient failures. Out-of-memory errors, temporary source unavailability, and other recoverable issues are handled through checkpoint-based restart strategies.
  • Continuous exactly-once processing. Checkpoints maintain the consistency boundary across sources, operators, and sinks throughout normal operation.

You do not need to think about checkpoints during normal operations. Configure them once, monitor checkpoint duration and size in the Flink dashboard, and let the framework handle the rest.

When to Use Savepoints

Savepoints are the right tool for everything you plan in advance:

  • Job upgrades. Deploying a new version of your application code while preserving accumulated state (window contents, counters, aggregations).
  • Flink version upgrades. Moving from one Flink minor version to another (e.g., 1.17 to 1.18) while retaining state.
  • Scaling changes. Increasing or decreasing parallelism to match changed throughput requirements. Savepoints allow state to be redistributed across a different number of parallel instances.
  • Cluster migrations. Moving a job from one Kubernetes cluster, cloud region, or cloud provider to another.
  • A/B testing. Forking a single savepoint into two job instances running different code paths against the same state baseline.
  • Pre-change backups. Creating a known-good snapshot before any risky configuration or schema change so you can roll back if something goes wrong.

Savepoint Operations

Triggering a Savepoint

Use the Flink CLI to trigger a savepoint against a running job:

# Trigger a savepoint and write it to a target directory
flink savepoint <jobId> s3://my-bucket/savepoints/

# Trigger a savepoint and stop the job (cancel-with-savepoint)
flink stop --savepointPath s3://my-bucket/savepoints/ <jobId>

The stop command is the standard way to gracefully shut down a job for a planned upgrade. It triggers a savepoint, waits for it to complete, and then cancels the job - guaranteeing that all in-flight records are drained through the pipeline before the snapshot is taken.

Restoring from a Savepoint

Start a job from a savepoint by passing the savepoint path at submission time:

flink run -s s3://my-bucket/savepoints/savepoint-abc123 my-job.jar

Flink will map the state in the savepoint to operators in the new job graph using operator UIDs. If the mapping succeeds, the job starts processing from exactly where it left off.

Disposing of a Savepoint

When you no longer need a savepoint, remove it explicitly:

flink savepoint dispose s3://my-bucket/savepoints/savepoint-abc123

This deletes the savepoint metadata and state files from storage. Failing to clean up old savepoints is a common source of unbounded storage growth in long-running production environments.

State Compatibility and UID Discipline

Savepoint restoration depends on Flink’s ability to match stored state to operators in the new job graph. This matching is done by operator UID, not by position in the graph. If you do not assign UIDs explicitly, Flink generates them automatically based on the topological order of operators - and those generated UIDs will change whenever you add, remove, or reorder operators.

Always assign stable UIDs to every stateful operator:

DataStream<Event> stream = env
    .addSource(new KafkaSource<>())
    .uid("kafka-source")
    .name("Kafka Source")
    .keyBy(Event::getUserId)
    .process(new SessionProcessor())
    .uid("session-processor")
    .name("Session Processor")
    .addSink(new JdbcSink<>())
    .uid("jdbc-sink")
    .name("JDBC Sink");

With explicit UIDs, you can safely make the following changes between savepoint and restore:

  • Add new operators. New operators start with empty state.
  • Remove operators. Use --allowNonRestoredState (or -n) to skip state for operators that no longer exist.
  • Change parallelism. Flink redistributes keyed state and list state across the new parallel instances.
  • Modify operator logic. As long as the state serializers remain compatible (same schema or a supported schema evolution), the state loads correctly.

Changes that break savepoint compatibility include renaming a UID, changing the state type in an incompatible way (e.g., switching from ValueState<String> to ValueState<Integer> without a migration path), or removing a UID assignment entirely.

Here is a step-by-step process for upgrading a running Flink job using savepoints:

  1. Verify UID coverage. Confirm that every stateful operator in both the current and new job versions has an explicit .uid() assigned. This is the single most important prerequisite.

  2. Take a savepoint and stop the job.

    flink stop --savepointPath s3://my-bucket/savepoints/ <jobId>
    
  3. Validate the savepoint. Check the Flink logs or the REST API to confirm the savepoint completed successfully and note the savepoint path.

  4. Deploy the new job JAR. Upload the updated application artifact to your cluster or container registry.

  5. Restore from the savepoint.

    flink run -s s3://my-bucket/savepoints/savepoint-abc123 new-job.jar
    
  6. Monitor the restored job. Watch the Flink dashboard for checkpoint completion, throughput recovery, and any state restoration errors. Confirm that consumer lag on your source topics is decreasing.

  7. Clean up. Once the new job is stable and a fresh checkpoint has completed, dispose of the savepoint to reclaim storage.

This workflow provides zero data loss during the upgrade window. The job processes every record exactly once across the version boundary.

Best Practices

Assign UIDs from day one. Retrofitting UIDs onto an existing job that was deployed without them is painful. State from the old auto-generated UIDs cannot be mapped to the new explicit UIDs, meaning you lose accumulated state. Make .uid() assignment a mandatory part of your code review checklist.

Take a savepoint before every planned change. Whether you are changing one line of business logic or upgrading the Flink version, a savepoint is your rollback point. The cost of taking one is measured in seconds; the cost of not having one is measured in hours of reprocessing.

Configure checkpoint retention. Keep at least 2-3 recent externalized checkpoints with RETAIN_ON_CANCELLATION. This gives you a recovery option even if a savepoint trigger fails unexpectedly.

Implement savepoint retention policies. Savepoints accumulate indefinitely. Establish a policy - for example, keep the last 3 savepoints per job and delete older ones automatically as part of your CI/CD pipeline.

Monitor checkpoint health continuously. Growing checkpoint durations, increasing checkpoint sizes, or frequent checkpoint failures are early indicators of state management problems that will eventually affect savepoint operations too. Treat checkpoint metrics as leading indicators.

Test savepoint restore in staging. Before upgrading production, restore the production savepoint into a staging environment running the new code. This catches state compatibility issues before they affect your live pipeline.

Platforms like Streamkap handle checkpoint and savepoint lifecycle automatically for managed Flink pipelines. When you trigger a pipeline upgrade or scaling change through the platform, Streamkap takes a savepoint, stops the job, deploys the new configuration, and restores from the savepoint - removing the need to manage these operations manually. For teams that want the power of Flink’s state management without the operational overhead, a managed approach eliminates an entire category of production risk.