What is data observability for streaming?

Data observability is the ability to understand the health and quality of data flowing through streaming pipelines - not just whether the infrastructure is running, but whether the data itself is correct, complete, and fresh. It combines infrastructure monitoring (CPU, memory, throughput) with data-level monitoring (freshness, volume, schema, distribution).

What are the key metrics for streaming pipeline observability?

Essential metrics include: end-to-end latency (freshness), records per second (throughput), consumer lag (Kafka), error rate (failed records), schema changes (drift detection), null rates per column, and volume anomalies (unexpected spikes or drops). Track these at both the pipeline level and the individual table/topic level.

How is streaming observability different from batch observability?

Batch pipelines run on schedules, so you check quality after each run. Streaming pipelines run continuously, so observability must be continuous too - you cannot wait for a 'run' to complete. Streaming observability requires real-time metric collection, anomaly detection on live data, and immediate alerting.

How does Streamkap provide data observability?

Streamkap provides built-in observability dashboards that track pipeline latency, throughput, error rates, and data freshness in real time. The platform detects anomalies in volume and schema changes automatically, alerting you before issues impact downstream consumers.

Data Observability for Streaming Pipelines: Metrics That Matter

Your Kafka cluster is green. CPU usage is normal. Memory looks fine. And yet somehow, your analytics dashboard has been serving stale data for the last three hours. Sound familiar? This scenario plays out constantly at organizations that rely solely on infrastructure monitoring for their streaming pipelines. The servers are healthy, but the data is broken - and nobody knows until a stakeholder notices numbers that do not add up.

Infrastructure monitoring tells you whether your systems are running. Data observability tells you whether your data is correct. For streaming pipelines, where data flows continuously and errors propagate in real time, that distinction is the difference between catching a problem in two minutes and catching it in two days.

The Five Pillars of Streaming Data Observability

Borrowing from the framework popularized by Barr Moses and the Monte Carlo team, data observability rests on five pillars. Each one answers a fundamental question about the data flowing through your pipeline.

Freshness

Is data arriving on time? Freshness measures the time gap between when an event occurs at the source and when it becomes available at the destination. For a CDC pipeline reading from PostgreSQL, freshness is the difference between a row being updated in Postgres and that update landing in your warehouse.

A freshness SLO might look like: “95% of change events arrive at Snowflake within 60 seconds of the source commit.” When freshness degrades, it usually signals consumer lag, network issues, or backpressure in the pipeline.

Volume

Is the expected amount of data arriving? Volume anomalies are among the most reliable early warning signals. If your orders table normally streams 10,000 records per hour and suddenly drops to 500, something is wrong upstream - even if the pipeline itself reports no errors.

Volume monitoring requires establishing baselines. A table that processes 50 records per hour on weekends and 5,000 on weekday mornings needs time-aware thresholds, not a single static number.

Schema

Has the data structure changed? Schema drift is one of the most insidious problems in streaming pipelines. A developer adds a column to a source table, renames a field, or changes a data type - and suddenly your downstream transformations break, your warehouse loads fail, or worse, data silently lands in the wrong columns.

Effective schema monitoring detects changes the moment they appear in the stream.

Distribution

Are values within expected ranges? Distribution monitoring catches data quality issues that pass every other check. The pipeline is fresh, the volume is normal, the schema has not changed - but suddenly 40% of your email column is null, or order_total has negative values. Statistical profiling of column-level distributions catches these anomalies.

Lineage

Where did this data come from, and where does it go? Lineage maps the path data takes from source to destination and every transformation in between. When something breaks, lineage tells you what is affected downstream. When a dashboard shows wrong numbers, lineage tells you which upstream system to investigate.

Infrastructure Metrics: The Foundation

Before you can monitor data quality, you need healthy infrastructure. These are the metrics that keep the lights on.

Kafka Consumer Lag

Consumer lag - the offset difference between the latest produced message and the latest consumed message - is the single most important infrastructure metric for Kafka-based streaming pipelines. Rising lag means your consumers cannot keep up with the rate of incoming data.

# Check consumer lag with kafka-consumer-groups
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group my-pipeline-consumer-group

Monitor lag at the partition level, not just the topic level. A single slow partition can hide behind healthy aggregate numbers.

Flink Checkpoint Duration and Failures

If you are using Apache Flink for stream processing, checkpoint health is critical. Checkpoints are Flink’s mechanism for fault tolerance - if they start taking longer or failing entirely, your pipeline is at risk of data loss during restarts.

Key Flink metrics to track:

lastCheckpointDuration - how long the most recent checkpoint took
numberOfFailedCheckpoints - should be zero in a healthy pipeline
lastCheckpointSize - sudden size changes indicate state growth issues

Resource Utilization

Standard infrastructure metrics still matter. Track CPU, memory, disk I/O, and network throughput for every component in the pipeline - source connectors, Kafka brokers, stream processors, and sink connectors. The goal is not just alerting on high utilization, but correlating resource spikes with data quality issues.

Data-Level Metrics: What Really Matters

Infrastructure metrics tell you the pipeline is running. Data-level metrics tell you the data is right.

Null Rates Per Column

Track the percentage of null values for every column in every table you stream. Establish baselines and alert on deviations. A column that is normally 2% null jumping to 30% null is a data quality incident, even if the pipeline is running perfectly.

-- Example: Monitoring null rates in a destination warehouse
SELECT
  COUNT(*) AS total_rows,
  COUNT(*) FILTER (WHERE email IS NULL) AS null_emails,
  ROUND(
    100.0 * COUNT(*) FILTER (WHERE email IS NULL) / COUNT(*),
    2
  ) AS null_email_pct
FROM orders
WHERE _streamkap_loaded_at > NOW() - INTERVAL '1 hour';

Volume Trends and Anomalies

Track records-per-minute at the table level and compare against historical baselines. The most effective approach combines time-of-day awareness with day-of-week patterns:

# Pseudocode: Simple volume anomaly detection
def check_volume_anomaly(table, current_count, lookback_days=14):
    historical = get_historical_counts(
        table,
        same_hour=True,
        same_day_of_week=True,
        days=lookback_days
    )
    mean = statistics.mean(historical)
    stddev = statistics.stdev(historical)

    if abs(current_count - mean) > 3 * stddev:
        alert(f"Volume anomaly on {table}: "
              f"expected ~{mean:.0f}, got {current_count}")

Schema Change Detection

Log every schema change event with a timestamp, the affected table, and the specific change (column added, removed, renamed, or type changed). Maintain a schema registry or changelog that your team can review.

Value Distribution Profiling

Beyond null rates, profile the statistical distribution of numeric columns and the cardinality of categorical columns. Alert when:

A numeric column’s mean or standard deviation shifts significantly
A categorical column gains unexpected new values
A column’s distinct count drops dramatically (possible data duplication)

Building an Observability Stack

A practical streaming observability stack combines metrics collection, visualization, and data quality checks.

Prometheus and Grafana

Prometheus is the standard for collecting time-series metrics from Kafka, Flink, and connector frameworks. Grafana provides dashboards and alerting on top of those metrics.

A typical setup exposes JMX metrics from Kafka brokers and consumers via the Prometheus JMX Exporter, scrapes them into Prometheus, and visualizes them in Grafana dashboards organized by pipeline.

# prometheus.yml - Scrape config for Kafka metrics
scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets: ['broker-1:7071', 'broker-2:7071', 'broker-3:7071']

  - job_name: 'kafka-connect'
    static_configs:
      - targets: ['connect-1:7072']

Custom Data Quality Metrics

Infrastructure metrics come from Prometheus. Data quality metrics often need custom collection. Build lightweight checks that run continuously against your destination and publish results as Prometheus metrics or send them to a dedicated quality tracking system.

Key custom metrics to emit:

data_freshness_seconds{table="orders"} - age of the newest record
data_null_rate{table="orders", column="email"} - current null percentage
data_volume_per_minute{table="orders"} - recent ingestion rate
data_schema_version{table="orders"} - current schema version hash

Anomaly Detection: Beyond Static Thresholds

Static thresholds break down quickly. A table that processes 10,000 records per hour during business hours and 200 at night needs dynamic thresholds. There are several practical approaches.

Rolling standard deviation is the simplest effective method. Compare the current metric value against the mean and standard deviation of the same hour across the last two to four weeks. Alert at three standard deviations for critical issues, two for warnings.

Exponential moving averages (EMA) adapt to gradual trends. They are useful for metrics that naturally grow over time, like total volume, where a static baseline would produce constant false positives.

Seasonal decomposition handles weekly and monthly patterns. Libraries like Python’s statsmodels can decompose a time series into trend, seasonal, and residual components. Alerting on the residual component filters out expected variation.

The right approach depends on your data patterns. Start simple with rolling standard deviation and add complexity only when false positive rates demand it.

Alerting Strategy: Signal Without Noise

The fastest path to alert fatigue is alerting on everything equally. A sustainable alerting strategy distinguishes between severity levels, provides granularity, and prevents noise.

Critical vs. Warning

Critical alerts demand immediate action. They page your on-call engineer:

Pipeline fully stopped (zero throughput for more than five minutes)
Consumer lag exceeding your freshness SLO
Error rate above 1% of total records
Data loss detected (gap in sequence numbers or timestamps)

Warning alerts go to a Slack channel for investigation during business hours:

Freshness degradation (above 50% of SLO but below 100%)
Volume anomaly (two standard deviations from baseline)
Schema change detected
Null rate increase beyond historical range

Per-Table Granularity

Alert at the table level, not the pipeline level. A pipeline streaming 50 tables might have 49 healthy tables and one with a critical volume drop. A pipeline-level aggregate would average out the problem. Per-table alerting catches it immediately.

Alert Fatigue Prevention

Implement deduplication windows - if a volume anomaly alert fires at 2:00 PM, suppress duplicates for 30 minutes. Use escalation chains: a warning that persists for an hour upgrades to critical. And always include runbook links in alert messages so the responder knows exactly what to check.

Practical Example: Observability for a CDC-to-Snowflake Pipeline

Consider a concrete setup: PostgreSQL source, CDC via Debezium, Kafka as the message bus, and Snowflake as the destination. Here is what a complete observability configuration looks like.

Source monitoring: Track the PostgreSQL replication slot lag (pg_stat_replication.replay_lag). If the slot falls behind, Debezium will fall behind.

Connector monitoring: Track Debezium connector status (RUNNING, PAUSED, FAILED), task-level error counts, and snapshot progress if running an initial load.

Kafka monitoring: Track consumer lag per partition, broker under-replicated partitions, and request latency percentiles.

Destination monitoring: Track Snowflake ingestion latency (time from Kafka to queryable in Snowflake), COPY command success and failure rates, and warehouse credit consumption.

Data quality monitoring: After data lands in Snowflake, run continuous checks on freshness (newest _loaded_at timestamp), volume (records per table per hour), null rates on critical columns, and row count reconciliation between source and destination.

Platforms like Streamkap provide much of this observability out of the box - pipeline-level latency tracking, throughput monitoring, error detection, and schema change alerts - reducing the amount of custom tooling you need to build and maintain.

Data Lineage: Tracing the Full Path

Lineage answers the question: when something goes wrong, what else is affected?

In a streaming context, lineage maps the journey of data from source table through Kafka topics, transformations, and into destination tables, views, and dashboards. When a schema change breaks the orders topic, lineage tells you which Snowflake tables, dbt models, and Looker dashboards depend on that data.

Building Lineage

There are two approaches: runtime lineage, which instruments the pipeline to log which records flow through which components, and compile-time lineage, which parses configurations and SQL to build a static dependency graph. For most teams, compile-time lineage combined with topic-level runtime metadata is sufficient.

Impact Analysis

When you detect an issue - say, a freshness violation on the payments Kafka topic - lineage lets you immediately enumerate every downstream consumer: the payments table in Snowflake, the revenue_daily dbt model, the finance dashboard, and the ML model training pipeline. You can proactively notify those teams instead of waiting for them to discover the problem independently.

Observability Tools: Open Source and Commercial

The tooling for data observability has matured significantly. Here is a practical breakdown.

Open Source Options

Great Expectations lets you define data quality expectations as code - expected column types, value ranges, null rates, uniqueness constraints - and validate them against your data. It integrates well with orchestrators like Airflow and can run checks against warehouse tables on a schedule.

Soda Core provides a YAML-based syntax for defining data quality checks and runs against most SQL databases.

OpenMetadata and DataHub provide lineage tracking, metadata management, and data quality monitoring in unified open-source platforms.

Commercial Options

Monte Carlo, Bigeye, and Anomalo offer managed data observability with ML-powered anomaly detection, automated lineage, and incident management.

For streaming-specific observability, tools that understand Kafka offsets, consumer groups, and connector health - rather than just querying warehouse tables - provide significantly faster detection. A warehouse-only check that runs every 15 minutes has a detection floor of 15 minutes. A streaming-native check that monitors consumer lag in real time can detect issues in seconds.

Data observability for streaming pipelines is not optional - it is the difference between running a pipeline and trusting a pipeline. Infrastructure monitoring keeps your systems alive. Data observability keeps your data correct.

Start with the five pillars: freshness, volume, schema, distribution, and lineage. Instrument both your infrastructure (Kafka lag, Flink checkpoints, resource utilization) and your data (null rates, volume trends, schema changes). Build anomaly detection that adapts to your data’s natural patterns, and craft an alerting strategy that surfaces real problems without drowning your team in noise.

The teams that invest in observability early build features. The teams that skip it apologize for bad data.