<--- Back to all resources

Engineering

February 25, 2026

10 min read

Flink Job Monitoring: Key Metrics and Alerting Strategies

Learn which Flink metrics to monitor in production - throughput, latency, checkpoints, state size, and backpressure. Build dashboards and alerts that catch issues before they become outages.

TL;DR: • The five essential Flink metrics to monitor are: records in/out (throughput), checkpoint duration and failure rate, state size, backpressure ratio, and consumer lag. • Alert on trends, not just thresholds - a gradually increasing checkpoint duration is a leading indicator of problems, long before a checkpoint actually fails. • A well-instrumented Flink job surfaces issues in minutes; an unmonitored job fails silently for hours.

Introduction: Why Monitoring Is Not Optional

A Flink job that runs without monitoring is a time bomb. It might process millions of records per second today and silently fall behind tomorrow. Without metrics, you will not know until downstream consumers complain, dashboards go stale, or data is simply missing.

Production Flink deployments demand the same observability rigor you would apply to any stateful distributed system. The challenge is that Flink exposes hundreds of metrics across the JobManager, TaskManagers, operators, and connectors. Not all of them matter equally. Knowing which metrics to watch, how to visualize them, and when to alert is what separates a resilient pipeline from one that wakes you up at 3 AM.

This guide covers the five metric categories that matter most for production Flink jobs, walks through a Prometheus and Grafana setup, and provides an alerting strategy that catches real problems without drowning you in noise.

Throughput Metrics

Throughput is the most intuitive health signal. If records stop flowing, something is wrong.

Key Metrics

  • numRecordsInPerSecond - Records entering an operator per second
  • numRecordsOutPerSecond - Records leaving an operator per second
  • numBytesInPerSecond / numBytesOutPerSecond - Byte-level throughput

What to Watch

Monitor throughput at the operator level, not just the job level. A job-level aggregate can mask a single parallelism instance that has stalled. If one subtask of a KeyedProcess operator drops to zero records out while the others are healthy, you have a data skew or partition problem.

The ratio between records in and records out for a given operator is also telling. A filter operator with a 99.9% drop rate might be correct, or it might indicate a schema change upstream that is causing every record to fail a predicate.

# Per-operator throughput across all subtasks
sum by (task_name) (
  flink_taskmanager_job_task_numRecordsInPerSecond
)

Compare current throughput against a baseline. A CDC pipeline reading from a transactional database will have natural peaks and valleys that follow application traffic patterns. Use a 7-day rolling comparison rather than a static threshold to detect meaningful deviations.

Checkpoint Metrics

Checkpoints are the backbone of Flink’s exactly-once guarantees. A failing checkpoint means your job cannot recover to a consistent state if it crashes.

Key Metrics

  • lastCheckpointDuration - How long the most recent checkpoint took (ms)
  • lastCheckpointSize - Size of the most recent checkpoint (bytes)
  • lastCheckpointAlignmentDuration - Time subtasks spent waiting for barrier alignment
  • numberOfCompletedCheckpoints - Total successful checkpoints
  • numberOfFailedCheckpoints - Total failed checkpoints

What to Watch

Checkpoint duration is a leading indicator. A checkpoint that normally takes 5 seconds and is now taking 30 seconds is not yet a failure, but it is heading toward one. If checkpoint duration exceeds the checkpoint interval, checkpoints will queue up and eventually time out.

Alignment duration reveals whether barrier alignment is causing latency spikes. In aligned checkpoints (the default for exactly-once), slow subtasks force faster subtasks to buffer data while waiting. If alignment duration is a significant fraction of total checkpoint duration, you may have data skew or a slow disk on one TaskManager.

# Checkpoint duration trend over time
flink_jobmanager_job_lastCheckpointDuration / 1000

A single failed checkpoint is not necessarily an emergency. Network blips happen. But consecutive failures - or a failure rate that is climbing - indicates a systemic issue that requires intervention.

State Metrics

Flink operators that use keyed state (windows, aggregations, joins) store that state in a state backend. Unbounded state growth is one of the most common causes of Flink job degradation.

Key Metrics

  • lastCheckpointSize - Total state size (a proxy when using incremental checkpoints)
  • RocksDB metrics (when using RocksDB state backend):
    • rocksdb.estimate-num-keys - Number of keys in the state
    • rocksdb.estimate-live-data-size - Estimated live data size
    • rocksdb.compaction-pending - Pending compactions
    • rocksdb.block-cache-usage - Block cache memory usage

What to Watch

State size should be relatively stable for a well-designed job. A monotonically increasing state size usually means one of these things: a window that never triggers, a join with no eviction policy, or a broadcast state that accumulates records.

Track state size as a time series and set alerts on the rate of change. A CDC pipeline that deduplicates by primary key should have a state size proportional to the number of unique keys in the source table. If state is growing faster than the source table, something is accumulating that should not be.

RocksDB compaction metrics matter when you are running large state. If compaction-pending stays high, RocksDB cannot keep up with writes, and read performance will degrade. This often manifests as increasing checkpoint duration before any other visible symptom.

# State size growth rate (bytes per hour)
deriv(flink_jobmanager_job_lastCheckpointSize[1h]) * 3600

Backpressure Metrics

Backpressure means a downstream operator cannot keep up with the rate at which an upstream operator is sending data. It is the most common cause of throughput drops in Flink jobs.

Key Metrics

  • isBackPressured - Boolean flag per subtask (1 = backpressured)
  • busyTimeMsPerSecond - Milliseconds per second the operator thread is busy (0-1000)
  • backPressuredTimeMsPerSecond - Milliseconds per second the operator is backpressured
  • idleTimeMsPerSecond - Milliseconds per second the operator is idle

What to Watch

The key insight with backpressure is that the bottleneck operator is typically not the one showing the backpressure flag. Backpressure propagates upstream. If operator C is slow, operator B will show backpressured, and eventually operator A will too. To find the root cause, look for the first operator in the chain that is NOT backpressured but has a high busyTimeMsPerSecond - that is your bottleneck.

# Find operators with high busy time (potential bottlenecks)
flink_taskmanager_job_task_busyTimeMsPerSecond > 900

A busyTimeMsPerSecond above 900 (90% utilization) is a warning sign. Above 950 and you have almost no headroom for traffic spikes. The fix depends on the bottleneck: increase parallelism, optimize the operator logic, or scale up the downstream system that the sink is writing to.

Consumer Lag

For Flink jobs that consume from Kafka (which includes most CDC pipelines), consumer lag is the single most important end-to-end health metric.

Key Metrics

  • records-lag-max - Maximum lag in records across all partitions for a consumer group
  • committed-offset vs current-offset - Gap between what has been committed back to Kafka and what has been consumed
  • Kafka-side consumer group lag - Measured externally via kafka-consumer-groups.sh or Kafka monitoring tools

What to Watch

Consumer lag tells you whether the Flink job is keeping up with the rate of incoming data. A lag of zero means the job is consuming records as fast as they are produced. A steadily growing lag means the job is falling behind.

Short spikes in lag during traffic peaks are normal if the job catches up afterward. Sustained growth is a problem. The time to alert is not when lag exceeds a threshold, but when lag has been growing continuously for N minutes.

# Consumer lag growth rate (records per minute)
deriv(kafka_consumer_records_lag_max[5m]) * 60

For CDC pipelines, commit lag is equally important. Flink might consume records from Kafka quickly, but if it cannot write them to the destination (because the sink is slow or the destination is unavailable), the committed offset will fall behind the consumed offset.

Setting Up Prometheus and Grafana

Add the Prometheus metric reporter to flink-conf.yaml:

metrics.reporter.prometheus.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prometheus.port: 9249

# Expose system metrics for JVM monitoring
metrics.system-resource: true
metrics.system-resource-probing-interval: 5000

For Kubernetes deployments, add a pod annotation so Prometheus discovers Flink TaskManagers automatically:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "9249"
  prometheus.io/path: "/metrics"

Prometheus Scrape Configuration

scrape_configs:
  - job_name: 'flink'
    scrape_interval: 15s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:9249

For non-Kubernetes deployments, use static targets listing each TaskManager’s hostname and port.

Grafana Dashboard Templates

The Flink community provides a well-maintained Grafana dashboard (ID 14911) that covers the basics. Import it as a starting point, then customize with the panels described in the practical dashboard section below.

Alerting Strategy

Critical Alerts (Page On-Call)

These indicate data loss risk or pipeline failure. They require immediate attention.

AlertConditionFor Duration
Job Not Runningflink_jobmanager_job_uptime == 02 minutes
Checkpoint Failuresincrease(numberOfFailedCheckpoints[10m]) > 010 minutes
Consumer Lag Growingderiv(kafka_consumer_records_lag_max[5m]) > 05 minutes sustained

Warning Alerts (Notify Channel)

These are leading indicators that something will become a problem if not addressed.

AlertConditionFor Duration
Checkpoint Duration Increasingderiv(lastCheckpointDuration[30m]) > 030 minutes
High BackpressurebusyTimeMsPerSecond > 90010 minutes
State Size Growingderiv(lastCheckpointSize[1h]) > threshold1 hour
Throughput DropnumRecordsInPerSecond < 0.5 * avg_over_time(numRecordsInPerSecond[7d])15 minutes

Avoiding Alert Fatigue

The most common mistake is alerting on instantaneous values. A single checkpoint taking longer than usual is noise. A trend of increasing checkpoint duration over 30 minutes is signal. Use deriv() and avg_over_time() in PromQL to detect trends rather than spikes.

Group related alerts. If a job is not running, do not also fire alerts for zero throughput, growing lag, and failed checkpoints. Use Alertmanager’s inhibition rules to suppress downstream symptoms when the root cause alert is already firing.

inhibit_rules:
  - source_matchers:
      - alertname="FlinkJobNotRunning"
    target_matchers:
      - alertname=~"FlinkCheckpoint.*|FlinkThroughput.*|FlinkLag.*"
    equal: ['job_name']

Practical Dashboard for a CDC Pipeline

A single-pane dashboard for a CDC pipeline should have four rows.

Row 1 - Overview: Job status (running/failed), total records in/out per second, end-to-end latency (consumer lag converted to time), and uptime.

Row 2 - Throughput: Records in per second by source operator, records out per second by sink operator, byte throughput, and a 7-day overlay for comparison.

Row 3 - Health Indicators: Checkpoint duration over time, checkpoint size over time, backpressure heatmap by operator, and busyTimeMsPerSecond by subtask.

Row 4 - Kafka Metrics: Consumer lag per partition, committed offset vs latest offset, consumer group rebalance events, and produce-to-consume latency.

Each panel should have a threshold line marking the alert boundary so that on-call engineers can immediately see how close a metric is to firing an alert.

Operational Runbook

Job Not Running

  1. Check the Flink dashboard for the job status and exception history
  2. Review TaskManager logs for the root cause (OOM, connectivity failure, deserialization error)
  3. If OOM, increase TaskManager memory or reduce state size before restarting
  4. Restart from the latest checkpoint - never restart from scratch if a valid checkpoint exists

Checkpoint Failures

  1. Check lastCheckpointAlignmentDuration - if high, look for data skew across subtasks
  2. Check TaskManager disk I/O - checkpoints write to disk, and saturated disks cause timeouts
  3. Increase execution.checkpointing.timeout as a temporary measure while investigating
  4. If state is too large, consider enabling incremental checkpoints or tuning RocksDB

Consumer Lag Growing

  1. Verify the source (Kafka) throughput - is there a traffic spike, or has the job slowed down?
  2. Check backpressure metrics to find the bottleneck operator
  3. If the sink is the bottleneck, check destination system health (query latency, connection pool, rate limits)
  4. If throughput is the bottleneck, increase parallelism for the saturated operator

High Backpressure

  1. Identify the root bottleneck (first non-backpressured operator with high busyTimeMsPerSecond)
  2. If it is a sink operator, the destination system is the bottleneck - check batch size, connection pool, and destination load
  3. If it is a processing operator, profile the operator logic for inefficiencies
  4. Increase parallelism for the bottleneck operator, ensuring downstream operators can handle the increased throughput

State Size Growing Unexpectedly

  1. Identify which operator’s state is growing by checking per-operator state sizes
  2. Review the operator logic for unbounded state (missing TTL, windows that never fire)
  3. If a join or deduplication operator, verify the key space is bounded
  4. Configure state TTL for operators that should expire old entries: StateTtlConfig.newBuilder(Time.hours(24)).build()

Platforms like Streamkap handle much of this operational burden automatically, providing pre-built monitoring dashboards and alerting for Flink-based CDC pipelines. But whether you manage your own Flink cluster or use a managed platform, understanding these metrics is essential. The goal is not to eliminate every alert - it is to ensure that every alert you receive is actionable and leads you directly to the fix.