<--- Back to all resources
Flink Job Monitoring: Key Metrics and Alerting Strategies
Learn which Flink metrics to monitor in production - throughput, latency, checkpoints, state size, and backpressure. Build dashboards and alerts that catch issues before they become outages.
Introduction: Why Monitoring Is Not Optional
A Flink job that runs without monitoring is a time bomb. It might process millions of records per second today and silently fall behind tomorrow. Without metrics, you will not know until downstream consumers complain, dashboards go stale, or data is simply missing.
Production Flink deployments demand the same observability rigor you would apply to any stateful distributed system. The challenge is that Flink exposes hundreds of metrics across the JobManager, TaskManagers, operators, and connectors. Not all of them matter equally. Knowing which metrics to watch, how to visualize them, and when to alert is what separates a resilient pipeline from one that wakes you up at 3 AM.
This guide covers the five metric categories that matter most for production Flink jobs, walks through a Prometheus and Grafana setup, and provides an alerting strategy that catches real problems without drowning you in noise.
Throughput Metrics
Throughput is the most intuitive health signal. If records stop flowing, something is wrong.
Key Metrics
numRecordsInPerSecond- Records entering an operator per secondnumRecordsOutPerSecond- Records leaving an operator per secondnumBytesInPerSecond/numBytesOutPerSecond- Byte-level throughput
What to Watch
Monitor throughput at the operator level, not just the job level. A job-level aggregate can mask a single parallelism instance that has stalled. If one subtask of a KeyedProcess operator drops to zero records out while the others are healthy, you have a data skew or partition problem.
The ratio between records in and records out for a given operator is also telling. A filter operator with a 99.9% drop rate might be correct, or it might indicate a schema change upstream that is causing every record to fail a predicate.
# Per-operator throughput across all subtasks
sum by (task_name) (
flink_taskmanager_job_task_numRecordsInPerSecond
)
Compare current throughput against a baseline. A CDC pipeline reading from a transactional database will have natural peaks and valleys that follow application traffic patterns. Use a 7-day rolling comparison rather than a static threshold to detect meaningful deviations.
Checkpoint Metrics
Checkpoints are the backbone of Flink’s exactly-once guarantees. A failing checkpoint means your job cannot recover to a consistent state if it crashes.
Key Metrics
lastCheckpointDuration- How long the most recent checkpoint took (ms)lastCheckpointSize- Size of the most recent checkpoint (bytes)lastCheckpointAlignmentDuration- Time subtasks spent waiting for barrier alignmentnumberOfCompletedCheckpoints- Total successful checkpointsnumberOfFailedCheckpoints- Total failed checkpoints
What to Watch
Checkpoint duration is a leading indicator. A checkpoint that normally takes 5 seconds and is now taking 30 seconds is not yet a failure, but it is heading toward one. If checkpoint duration exceeds the checkpoint interval, checkpoints will queue up and eventually time out.
Alignment duration reveals whether barrier alignment is causing latency spikes. In aligned checkpoints (the default for exactly-once), slow subtasks force faster subtasks to buffer data while waiting. If alignment duration is a significant fraction of total checkpoint duration, you may have data skew or a slow disk on one TaskManager.
# Checkpoint duration trend over time
flink_jobmanager_job_lastCheckpointDuration / 1000
A single failed checkpoint is not necessarily an emergency. Network blips happen. But consecutive failures - or a failure rate that is climbing - indicates a systemic issue that requires intervention.
State Metrics
Flink operators that use keyed state (windows, aggregations, joins) store that state in a state backend. Unbounded state growth is one of the most common causes of Flink job degradation.
Key Metrics
lastCheckpointSize- Total state size (a proxy when using incremental checkpoints)- RocksDB metrics (when using RocksDB state backend):
rocksdb.estimate-num-keys- Number of keys in the staterocksdb.estimate-live-data-size- Estimated live data sizerocksdb.compaction-pending- Pending compactionsrocksdb.block-cache-usage- Block cache memory usage
What to Watch
State size should be relatively stable for a well-designed job. A monotonically increasing state size usually means one of these things: a window that never triggers, a join with no eviction policy, or a broadcast state that accumulates records.
Track state size as a time series and set alerts on the rate of change. A CDC pipeline that deduplicates by primary key should have a state size proportional to the number of unique keys in the source table. If state is growing faster than the source table, something is accumulating that should not be.
RocksDB compaction metrics matter when you are running large state. If compaction-pending stays high, RocksDB cannot keep up with writes, and read performance will degrade. This often manifests as increasing checkpoint duration before any other visible symptom.
# State size growth rate (bytes per hour)
deriv(flink_jobmanager_job_lastCheckpointSize[1h]) * 3600
Backpressure Metrics
Backpressure means a downstream operator cannot keep up with the rate at which an upstream operator is sending data. It is the most common cause of throughput drops in Flink jobs.
Key Metrics
isBackPressured- Boolean flag per subtask (1 = backpressured)busyTimeMsPerSecond- Milliseconds per second the operator thread is busy (0-1000)backPressuredTimeMsPerSecond- Milliseconds per second the operator is backpressuredidleTimeMsPerSecond- Milliseconds per second the operator is idle
What to Watch
The key insight with backpressure is that the bottleneck operator is typically not the one showing the backpressure flag. Backpressure propagates upstream. If operator C is slow, operator B will show backpressured, and eventually operator A will too. To find the root cause, look for the first operator in the chain that is NOT backpressured but has a high busyTimeMsPerSecond - that is your bottleneck.
# Find operators with high busy time (potential bottlenecks)
flink_taskmanager_job_task_busyTimeMsPerSecond > 900
A busyTimeMsPerSecond above 900 (90% utilization) is a warning sign. Above 950 and you have almost no headroom for traffic spikes. The fix depends on the bottleneck: increase parallelism, optimize the operator logic, or scale up the downstream system that the sink is writing to.
Consumer Lag
For Flink jobs that consume from Kafka (which includes most CDC pipelines), consumer lag is the single most important end-to-end health metric.
Key Metrics
records-lag-max- Maximum lag in records across all partitions for a consumer groupcommitted-offsetvscurrent-offset- Gap between what has been committed back to Kafka and what has been consumed- Kafka-side consumer group lag - Measured externally via
kafka-consumer-groups.shor Kafka monitoring tools
What to Watch
Consumer lag tells you whether the Flink job is keeping up with the rate of incoming data. A lag of zero means the job is consuming records as fast as they are produced. A steadily growing lag means the job is falling behind.
Short spikes in lag during traffic peaks are normal if the job catches up afterward. Sustained growth is a problem. The time to alert is not when lag exceeds a threshold, but when lag has been growing continuously for N minutes.
# Consumer lag growth rate (records per minute)
deriv(kafka_consumer_records_lag_max[5m]) * 60
For CDC pipelines, commit lag is equally important. Flink might consume records from Kafka quickly, but if it cannot write them to the destination (because the sink is slow or the destination is unavailable), the committed offset will fall behind the consumed offset.
Setting Up Prometheus and Grafana
Flink Configuration
Add the Prometheus metric reporter to flink-conf.yaml:
metrics.reporter.prometheus.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prometheus.port: 9249
# Expose system metrics for JVM monitoring
metrics.system-resource: true
metrics.system-resource-probing-interval: 5000
For Kubernetes deployments, add a pod annotation so Prometheus discovers Flink TaskManagers automatically:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9249"
prometheus.io/path: "/metrics"
Prometheus Scrape Configuration
scrape_configs:
- job_name: 'flink'
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:9249
For non-Kubernetes deployments, use static targets listing each TaskManager’s hostname and port.
Grafana Dashboard Templates
The Flink community provides a well-maintained Grafana dashboard (ID 14911) that covers the basics. Import it as a starting point, then customize with the panels described in the practical dashboard section below.
Alerting Strategy
Critical Alerts (Page On-Call)
These indicate data loss risk or pipeline failure. They require immediate attention.
| Alert | Condition | For Duration |
|---|---|---|
| Job Not Running | flink_jobmanager_job_uptime == 0 | 2 minutes |
| Checkpoint Failures | increase(numberOfFailedCheckpoints[10m]) > 0 | 10 minutes |
| Consumer Lag Growing | deriv(kafka_consumer_records_lag_max[5m]) > 0 | 5 minutes sustained |
Warning Alerts (Notify Channel)
These are leading indicators that something will become a problem if not addressed.
| Alert | Condition | For Duration |
|---|---|---|
| Checkpoint Duration Increasing | deriv(lastCheckpointDuration[30m]) > 0 | 30 minutes |
| High Backpressure | busyTimeMsPerSecond > 900 | 10 minutes |
| State Size Growing | deriv(lastCheckpointSize[1h]) > threshold | 1 hour |
| Throughput Drop | numRecordsInPerSecond < 0.5 * avg_over_time(numRecordsInPerSecond[7d]) | 15 minutes |
Avoiding Alert Fatigue
The most common mistake is alerting on instantaneous values. A single checkpoint taking longer than usual is noise. A trend of increasing checkpoint duration over 30 minutes is signal. Use deriv() and avg_over_time() in PromQL to detect trends rather than spikes.
Group related alerts. If a job is not running, do not also fire alerts for zero throughput, growing lag, and failed checkpoints. Use Alertmanager’s inhibition rules to suppress downstream symptoms when the root cause alert is already firing.
inhibit_rules:
- source_matchers:
- alertname="FlinkJobNotRunning"
target_matchers:
- alertname=~"FlinkCheckpoint.*|FlinkThroughput.*|FlinkLag.*"
equal: ['job_name']
Practical Dashboard for a CDC Pipeline
A single-pane dashboard for a CDC pipeline should have four rows.
Row 1 - Overview: Job status (running/failed), total records in/out per second, end-to-end latency (consumer lag converted to time), and uptime.
Row 2 - Throughput: Records in per second by source operator, records out per second by sink operator, byte throughput, and a 7-day overlay for comparison.
Row 3 - Health Indicators: Checkpoint duration over time, checkpoint size over time, backpressure heatmap by operator, and busyTimeMsPerSecond by subtask.
Row 4 - Kafka Metrics: Consumer lag per partition, committed offset vs latest offset, consumer group rebalance events, and produce-to-consume latency.
Each panel should have a threshold line marking the alert boundary so that on-call engineers can immediately see how close a metric is to firing an alert.
Operational Runbook
Job Not Running
- Check the Flink dashboard for the job status and exception history
- Review TaskManager logs for the root cause (OOM, connectivity failure, deserialization error)
- If OOM, increase TaskManager memory or reduce state size before restarting
- Restart from the latest checkpoint - never restart from scratch if a valid checkpoint exists
Checkpoint Failures
- Check
lastCheckpointAlignmentDuration- if high, look for data skew across subtasks - Check TaskManager disk I/O - checkpoints write to disk, and saturated disks cause timeouts
- Increase
execution.checkpointing.timeoutas a temporary measure while investigating - If state is too large, consider enabling incremental checkpoints or tuning RocksDB
Consumer Lag Growing
- Verify the source (Kafka) throughput - is there a traffic spike, or has the job slowed down?
- Check backpressure metrics to find the bottleneck operator
- If the sink is the bottleneck, check destination system health (query latency, connection pool, rate limits)
- If throughput is the bottleneck, increase parallelism for the saturated operator
High Backpressure
- Identify the root bottleneck (first non-backpressured operator with high
busyTimeMsPerSecond) - If it is a sink operator, the destination system is the bottleneck - check batch size, connection pool, and destination load
- If it is a processing operator, profile the operator logic for inefficiencies
- Increase parallelism for the bottleneck operator, ensuring downstream operators can handle the increased throughput
State Size Growing Unexpectedly
- Identify which operator’s state is growing by checking per-operator state sizes
- Review the operator logic for unbounded state (missing TTL, windows that never fire)
- If a join or deduplication operator, verify the key space is bounded
- Configure state TTL for operators that should expire old entries:
StateTtlConfig.newBuilder(Time.hours(24)).build()
Platforms like Streamkap handle much of this operational burden automatically, providing pre-built monitoring dashboards and alerting for Flink-based CDC pipelines. But whether you manage your own Flink cluster or use a managed platform, understanding these metrics is essential. The goal is not to eliminate every alert - it is to ensure that every alert you receive is actionable and leads you directly to the fix.