<--- Back to all resources

Engineering

February 25, 2026

10 min read

Timestamp Handling in Streaming Pipelines: Timezones, Formats, and Event Time

A practical guide to handling timestamps correctly in real-time data pipelines. Covers timezone conversion, format normalization, event time extraction, and common pitfalls.

TL;DR: • Always store and process timestamps in UTC internally, converting to local time only at the presentation layer. • Timestamp format inconsistencies across sources are one of the most common causes of pipeline failures. • Event time should be extracted from the record itself, not from the system clock when processing. • CDC timestamps from database WALs are reliable event time sources because they reflect when the transaction actually committed.

Timestamps seem simple until they break your pipeline at 2:00 AM on the second Sunday of March. A daylight saving time transition causes a one-hour gap in your event windows. Your hourly aggregation job produces a null row. Downstream dashboards go blank. Your on-call engineer spends three hours tracing the issue back to a single TIMESTAMP WITHOUT TIME ZONE column in a source database.

This is not a hypothetical scenario. Timestamp handling is one of the most quietly destructive sources of bugs in streaming data systems. The problems are rarely dramatic enough to trigger alerts immediately, but they compound over time into data quality issues that erode trust in your entire pipeline.

UTC as the Internal Standard

The single most impactful decision you can make for timestamp handling is to adopt UTC as the canonical timezone for all internal processing. Every record flowing through your pipeline, every intermediate state store, and every message on your Kafka topics should carry timestamps in UTC.

The reasoning is straightforward. UTC has no daylight saving transitions, no political timezone changes, and no ambiguity. When you process everything in UTC:

  • Window boundaries are consistent regardless of where your servers are located
  • Joins between streams from different geographic regions align correctly
  • Aggregations over time do not produce gaps or overlaps during DST transitions
  • Log correlation across services becomes trivial

The conversion to local time should happen at exactly one place: the presentation layer. When a user in Tokyo needs to see an event time, your application converts from UTC to JST at render time. When an analyst in London queries the data warehouse, their BI tool applies the Europe/London offset. The data itself stays in UTC.

A common mistake is storing timestamps in the database server’s local timezone or, worse, in the application server’s local timezone. This creates silent data corruption when servers are migrated across regions, when cloud instances spin up in unexpected availability zones, or when DST rules change (which happens more often than you might think).

If you are building on Streamkap, CDC events captured from the source database’s write-ahead log already include transaction commit timestamps. These timestamps reflect the actual moment the data changed, and normalizing them to UTC early in the pipeline means your downstream consumers never have to guess about timezone semantics.

Format Normalization: Taming the Zoo

In theory, there is one timestamp format: ISO 8601. In practice, you will encounter all of these in a single pipeline:

  • 2026-02-25T14:30:00Z (ISO 8601 with Z suffix)
  • 2026-02-25T14:30:00+00:00 (ISO 8601 with explicit offset)
  • 2026-02-25 14:30:00 (no timezone indicator at all)
  • 1740494400000 (Unix epoch in milliseconds)
  • 1740494400 (Unix epoch in seconds)
  • 02/25/2026 2:30 PM EST (US locale format with abbreviated timezone)
  • 25-Feb-2026 14:30:00.000 (custom format with abbreviated month)
  • 20260225143000 (compact numeric format from legacy mainframes)

Each source system has its own conventions. PostgreSQL defaults to ISO 8601-ish output but the exact format depends on the DateStyle setting. MySQL’s DATETIME type has no timezone information at all. MongoDB stores dates as millisecond epoch internally but its drivers may serialize them differently. APIs might return RFC 2822, RFC 3339, or something entirely custom.

The fix is to normalize timestamps at the earliest possible point in your pipeline. Ideally, this happens in the source connector or the first transformation step, before any joins or aggregations. Your normalization layer should:

  1. Parse the incoming format into a language-native datetime object
  2. Apply timezone context if the source format lacks timezone information (this requires documentation about what each source actually means by a bare timestamp)
  3. Convert to UTC
  4. Serialize to your canonical format (ISO 8601 with Z suffix or epoch milliseconds)

The second step deserves extra attention. When a source sends 2026-02-25 14:30:00 without any timezone indicator, you have to know whether that means UTC, the database server’s timezone, or the application’s timezone. This is metadata that must be documented per source, and it is one of the most common causes of subtle off-by-hours bugs.

For internal processing in streaming engines like Apache Flink, epoch milliseconds are often the most efficient choice. They are a simple long value, trivial to compare and sort, and carry no timezone ambiguity because epoch time is inherently UTC-relative. For storage in data warehouses or for human-readable contexts, ISO 8601 with the Z suffix is the better choice because it sorts lexicographically and is universally parseable.

Event Time vs. Processing Time

Streaming systems operate with two distinct notions of time, and confusing them is a source of real data quality problems.

Event time is when something actually happened in the real world. A user clicked a button at 14:30:00. A transaction committed at 14:30:01. A sensor recorded a temperature at 14:30:02. This time is embedded in the record itself.

Processing time is when your streaming engine happens to process that record. If your pipeline has a five-second lag, the processing time for that 14:30:00 click might be 14:30:05. If there is a backlog from a temporary outage, the processing time could be minutes or hours after the event time.

For almost all analytical use cases, you want event time. Processing time is unreliable because it depends on system load, network delays, consumer lag, and reprocessing scenarios. If you window by processing time and then replay a topic from an earlier offset, your aggregations will be completely wrong because events from last Tuesday will land in today’s windows.

Extracting Event Time from CDC Streams

Change Data Capture is one of the most reliable sources of event time in a streaming pipeline. When a CDC connector like Debezium reads from a PostgreSQL WAL or a MySQL binlog, it extracts the transaction commit timestamp directly from the database’s transaction log. This timestamp reflects when the database actually persisted the change, not when the connector read it.

In a Debezium-based CDC event, the event time lives in the source.ts_ms field of the envelope. This is the millisecond-precision timestamp of the transaction commit. Streamkap’s CDC connectors propagate this timestamp through the pipeline, giving downstream processors a reliable event time without additional configuration.

For non-CDC sources like API polling or file ingestion, event time extraction is trickier. You need to identify a field in the record that represents “when this thing happened” and configure your pipeline to use that field. Sometimes the right field is obvious (created_at, event_timestamp). Sometimes there is no good candidate, and you are stuck approximating with ingestion time.

Apache Flink’s windowing and time-based operations depend on watermarks to track the progress of event time across a stream. A watermark is a declaration: “I believe all events with a timestamp less than or equal to W have arrived.” Once a watermark passes a window boundary, Flink closes that window and emits results.

The challenge is that real-world data is never perfectly ordered. Network delays, source system batching, and multi-partition consumption all cause events to arrive out of order. If your watermark strategy is too aggressive (allowing very little lateness), you will drop valid events. If it is too conservative (allowing minutes of lateness), your windows will not close promptly and your results will be delayed.

A common watermark strategy is BoundedOutOfOrdernessWatermarks with a maximum lateness parameter. Setting this to, say, 5 seconds means Flink will wait 5 seconds past the highest observed event time before closing a window. The right value depends on your source characteristics:

  • CDC from a single database: Out-of-orderness is typically minimal (sub-second) because the WAL is strictly ordered. A 1-2 second bound is usually sufficient.
  • Multi-region event streams: Events from different regions may arrive with several seconds of skew. A 5-10 second bound is more appropriate.
  • API polling with variable intervals: If your poll interval is 30 seconds, events can arrive with up to 30 seconds of disorder. Your watermark bound needs to account for this.

For Flink SQL users, you assign watermarks in the table definition:

CREATE TABLE orders (
    order_id STRING,
    amount DECIMAL(10, 2),
    event_time TIMESTAMP(3),
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'orders',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'json'
);

Events that arrive after the watermark has passed their window are considered “late.” Flink can handle late data through allowed lateness and side outputs, but these add complexity. The better approach is to understand your sources well enough to set watermark bounds that capture the vast majority of events on time.

Timestamp Precision: Millis, Micros, and Nanos

Not all timestamps are created equal in terms of precision, and mismatches between systems can cause subtle bugs.

PrecisionResolutionCommon Sources
Seconds1sLegacy systems, some APIs
Milliseconds1msKafka, most databases, Debezium
Microseconds1usPostgreSQL, ClickHouse, Snowflake
Nanoseconds1nsInfluxDB, some IoT sensors

The most common problem arises when you join or compare timestamps across sources with different precisions. A PostgreSQL TIMESTAMPTZ has microsecond precision. A Kafka record timestamp has millisecond precision. If you are comparing CDC event times against Kafka metadata timestamps, you need to truncate or round to the coarser precision. Failing to do this can cause join misses where two timestamps that represent the same moment do not match because one has extra precision digits.

Another gotcha: epoch timestamps without context. Is 1740494400 seconds or milliseconds? If it is seconds, it represents February 25, 2025. If it is milliseconds, it represents a date in the year 55139. The number of digits is a clue (seconds-since-epoch values are currently 10 digits, millisecond values are 13 digits), but relying on heuristics is fragile. Document the precision for every timestamp field in your schema registry.

When Streamkap processes CDC events, the timestamps maintain the precision of the source database. PostgreSQL microsecond timestamps flow through as microseconds. This matters when your destination also supports microsecond precision (like Snowflake or ClickHouse), because you preserve the full fidelity of the source data without rounding artifacts.

Cross-Source Timestamp Alignment

When your pipeline joins streams from multiple sources, timestamp alignment becomes a real engineering problem. Consider a pipeline that joins user clickstream events from a web analytics API with order records from a PostgreSQL database via CDC. The clickstream timestamps come from the browser’s JavaScript Date.now(), which is based on the user’s device clock. The order timestamps come from the database’s now() function, based on the server clock.

These two clocks are not synchronized. User devices can be minutes or even hours off from actual time. Even server clocks, if not running NTP, can drift by seconds. Your join logic needs to account for this.

Practical approaches include:

  • Use server-side timestamps whenever possible. If the web application records a server-side timestamp when the click event is received, use that instead of the client-side timestamp. It will be much closer to the database clock.
  • Apply clock skew tolerance in joins. Instead of an exact timestamp match, use a range-based join with a tolerance window (e.g., match events within a 10-second window).
  • Add ingestion timestamps as a fallback. Record when each system first received the event. This gives you a secondary time dimension for correlation.

The Daylight Saving Time Trap

DST transitions are the most predictable source of timestamp bugs, yet they catch teams off guard every year. The core issue: if any part of your pipeline uses local time for windowing, aggregation, or partitioning, DST will cause problems.

During the spring-forward transition (e.g., 2:00 AM becomes 3:00 AM in US Eastern), there is no 2:30 AM. An hourly window that spans 2:00-3:00 AM local time contains zero minutes of actual time. During the fall-back transition, there are two 1:30 AMs. An hourly window spanning 1:00-2:00 AM contains 120 minutes of data crammed into what should be a 60-minute bucket.

If you process entirely in UTC, DST does not exist. Your windows are always exactly 60 minutes. The only place DST matters is when converting for display, and that is the presentation layer’s problem.

For teams that must partition data by local date (e.g., daily tables partitioned by business date in a specific timezone), compute the date boundaries in UTC. For the America/New_York timezone, “business date 2026-03-08” runs from 2026-03-08T05:00:00Z to 2026-03-09T05:00:00Z (standard time), but “business date 2026-03-09” runs from 2026-03-09T05:00:00Z to 2026-03-10T04:00:00Z (daylight time). Precompute these boundaries in UTC and use them as your partition keys.

Putting It Into Practice

The rules for timestamps in streaming pipelines are not complicated, but they require discipline:

  1. Adopt UTC internally. No exceptions. Convert at the edges.
  2. Normalize formats early. Parse and convert in the source connector or first transformation.
  3. Extract event time from the record. Never rely on processing time for analytics.
  4. Document precision and timezone semantics per source. Put this in your schema registry or data catalog.
  5. Set watermark bounds based on source characteristics. Do not guess; measure actual out-of-orderness.
  6. Test DST transitions explicitly. Run your pipeline against synthetic data that spans a DST boundary.

Streamkap handles several of these automatically for CDC use cases. Source database timestamps are extracted, normalized, and propagated through the pipeline with their original precision. But for multi-source pipelines with heterogeneous timestamp formats, the normalization and alignment work falls on your pipeline design. Getting it right from the start saves you from the 2:00 AM pages that timestamp bugs inevitably produce.