<--- Back to all resources

Engineering

February 25, 2026

9 min read

Data Type Conversion in Real-Time Pipelines

Learn how to handle data type conversions in streaming pipelines - timestamps, numeric precision, string encodings, and cross-database type mapping for reliable data delivery.

TL;DR: • Data type conversion is one of the most common sources of pipeline failures - timestamps, numeric precision, and string encodings differ across databases and require explicit mapping. • The highest-risk conversions involve timestamps (timezone handling), decimals (precision loss), and binary data (encoding). • Implementing type conversion in the pipeline layer catches mismatches before they cause destination errors or silent data corruption.

Every database speaks its own dialect when it comes to data types. PostgreSQL has TIMESTAMPTZ. MySQL has DATETIME. MongoDB stores dates as BSON Date objects. Snowflake has TIMESTAMP_LTZ, TIMESTAMP_NTZ, and TIMESTAMP_TZ. When you build a streaming pipeline that moves data between any two of these systems, every single row passes through a type conversion boundary - and that boundary is where pipelines silently break.

Type conversion errors rarely announce themselves with a loud crash. Instead, they show up as a timestamp shifted by eight hours in a dashboard, a financial total off by fractions of a cent that compound into material discrepancies, or a text field truncated mid-character because the destination assumed Latin-1 while the source sent UTF-8. These are the kinds of bugs that survive in production for weeks before someone notices.

This guide walks through the most common type conversion challenges in real-time CDC pipelines, the specific failure modes each one produces, and the defensive patterns that prevent them.

Timestamp Conversions

Timestamps are the single most error-prone data type in cross-database pipelines. The root problem is deceptively simple: different databases store time differently, and the conversion between representations is not always lossless.

TIMESTAMP vs TIMESTAMPTZ

PostgreSQL distinguishes between TIMESTAMP WITHOUT TIME ZONE (stores a bare date-time with no timezone context) and TIMESTAMP WITH TIME ZONE (converts input to UTC for storage, then converts back to the session timezone on retrieval). These look similar in a query result but behave completely differently during replication.

When a CDC pipeline reads a TIMESTAMPTZ value from the PostgreSQL WAL, it gets UTC. That is clean and unambiguous. But when it reads a TIMESTAMP WITHOUT TIME ZONE, it gets a bare value like 2026-02-25 14:30:00 with absolutely no indication of what timezone that belongs to. If the application server was in America/New_York, that value is Eastern Time. If the pipeline assumes UTC, every timestamp is now five hours off.

MySQL makes this worse. DATETIME stores a literal date-time with no timezone conversion. TIMESTAMP converts to UTC on storage and back to the session timezone on retrieval - but in the binary log used by CDC, MySQL emits TIMESTAMP columns as UTC integers and DATETIME as-is. If your pipeline does not know which column type it is reading, it cannot interpret the value correctly.

Epoch Formats and Precision

Some systems emit timestamps as Unix epochs. Debezium, the CDC engine underlying many streaming platforms including Streamkap, represents timestamps as milliseconds since epoch for TIMESTAMP types and as ISO-8601 strings or epoch days for DATE types. The precision matters: a source that stores microseconds will lose the last three digits if the pipeline truncates to milliseconds.

-- PostgreSQL: microsecond precision
SELECT now()::timestamp;
-- 2026-02-25 14:30:00.123456

-- Debezium CDC output: millisecond epoch
-- 1772115000123 (microseconds truncated)

-- Snowflake TIMESTAMP_NTZ: nanosecond precision
-- No data loss if mapped correctly, but the pipeline must preserve precision

Best Practice

Normalize every timestamp to UTC with explicit timezone metadata as early as possible in the pipeline. If the source is timezone-naive, document the assumed timezone and apply the conversion in the transformation layer before writing to the destination.

Numeric Precision

Financial data, scientific measurements, and aggregated metrics all depend on numeric precision. The gap between exact-decimal arithmetic and floating-point approximation is the source of some of the most expensive bugs in data engineering.

DECIMAL vs FLOAT vs INTEGER

DECIMAL(p, s) (also NUMERIC) stores exact values with a defined precision (total digits) and scale (digits after the decimal). FLOAT and DOUBLE store IEEE 754 floating-point approximations. INTEGER, BIGINT, and SMALLINT store whole numbers.

The dangerous conversion is DECIMAL to FLOAT. Consider:

-- PostgreSQL source
CREATE TABLE invoices (
    amount NUMERIC(38, 10)
);
INSERT INTO invoices VALUES (1234567890.1234567890);

-- If pipeline converts to DOUBLE:
-- 1234567890.1234567165 (precision lost at the 7th decimal digit)

-- If pipeline maps to Snowflake NUMBER(38,10):
-- 1234567890.1234567890 (exact)

This is not a theoretical concern. If your pipeline intermediary format uses JSON (as many Kafka-based pipelines do), and the JSON serializer treats all numbers as IEEE 754 doubles, you lose precision on any DECIMAL with more than 15 significant digits. The fix is to serialize large decimals as strings and deserialize them as DECIMAL at the destination.

Integer Overflow

MySQL UNSIGNED BIGINT has a maximum value of 18,446,744,073,709,551,615. PostgreSQL does not have unsigned types, so its BIGINT tops out at 9,223,372,036,854,775,807. If your pipeline moves an unsigned bigint value above the signed maximum into a system that only supports signed integers, you get either an overflow error or silent wraparound to a negative number.

String and Encoding

Character Encoding

Most modern databases default to UTF-8, but “UTF-8” does not always mean the same thing. MySQL’s utf8 character set is actually a 3-byte subset that cannot store characters outside the Basic Multilingual Plane - including most emoji. MySQL’s utf8mb4 is true UTF-8. If a pipeline reads from a utf8mb4 column and writes to a system that expects 3-byte UTF-8, any 4-byte character (emoji, some CJK characters, mathematical symbols) will either be truncated or cause an error.

-- Source (MySQL utf8mb4): "Hello 🌍"
-- Destination with 3-byte UTF-8: "Hello " or "Hello ???" or error

String Length: CHAR vs VARCHAR vs TEXT

CHAR(10) is fixed-width and right-padded with spaces. VARCHAR(255) has a maximum length. TEXT is effectively unlimited. When moving data from TEXT to VARCHAR(255), any value longer than 255 characters is either truncated or rejected. The pipeline must either validate lengths or use a destination type with sufficient capacity.

PostgreSQL’s VARCHAR without a length specifier is effectively unlimited, but Snowflake’s VARCHAR defaults to 16,777,216 characters. BigQuery’s STRING has no length limit. Knowing the destination constraints before you configure the pipeline prevents runtime surprises.

Boolean and Enum

Boolean Representation

PostgreSQL BOOLEAN stores true/false. MySQL’s BOOLEAN is an alias for TINYINT(1), which stores 0 or 1. Some systems use 'T'/'F' strings, 'Y'/'N', or even 'yes'/'no'. A CDC pipeline reading MySQL’s boolean column gets an integer 0 or 1 from the binlog, not a native boolean.

If your destination is Snowflake, its BOOLEAN type accepts TRUE, FALSE, 0, 1, 'true', 'false', and several other variants. BigQuery is stricter---BOOL requires actual boolean values, not integers. The pipeline transformation layer must normalize the representation.

Enum Types

PostgreSQL ENUM is a custom type defined at the database level. In CDC output, enum values appear as their string representation. MySQL ENUM is defined inline in the column definition and stored as an integer index internally, but Debezium emits the string value. Both map cleanly to VARCHAR or STRING at the destination, but if you are relying on the ordinal position of the enum value, that information is lost in transit.

Binary and LOB Data

BYTEA in PostgreSQL, BLOB in MySQL, and Binary in MongoDB all store raw binary data. Most streaming pipeline intermediaries (Kafka with JSON or Avro serialization) cannot represent raw binary inline. The standard approach is base64 encoding.

-- Source: 48 bytes of binary data (a SHA-384 hash)
-- Kafka JSON: "aGVsbG8gd29ybGQ..." (base64 string, ~64 characters)
-- Destination: decode base64 back to binary, or store as string

The concern here is size. Base64 encoding increases data size by approximately 33%. For large binary objects (images, documents, serialized objects), this overhead compounds across millions of rows. Some destinations like Snowflake support BINARY types natively and can accept hex-encoded input, which is more compact than base64.

For very large objects (LOBs exceeding a few megabytes), the best practice is to store the binary data in object storage (S3, GCS) and stream only a reference URI through the pipeline.

NULL Semantics

NULL handling is one of those areas where every database agrees on the concept but disagrees on the implementation details.

PostgreSQL and MySQL both support NULL as a distinct value meaning “unknown” or “absent.” But an empty string ('') is a different value from NULL in both systems. Oracle, famously, treats empty strings as NULL. If your pipeline moves data from Oracle to PostgreSQL, every empty string becomes NULL at the source, and there is no way to recover the original intent.

MongoDB uses null as a BSON type, but a missing field is semantically different from a field explicitly set to null. CDC output may or may not distinguish between these cases depending on the connector configuration.

When writing to destinations, NULL handling affects default values and NOT NULL constraints. If the destination column has a DEFAULT clause, some databases apply the default when they receive NULL (if the column is NOT NULL) while others reject the insert. The pipeline must decide: pass NULL through and let the destination handle it, or apply COALESCE logic in the transformation layer.

Cross-Database Type Mapping Table

The following table shows common type mappings across popular source-destination pairs:

Source TypePostgreSQL to SnowflakeMySQL to BigQueryMongoDB to Snowflake
Integer / intINTEGER -> NUMBER(38,0)INT -> INT64NumberInt -> NUMBER(38,0)
Big IntegerBIGINT -> NUMBER(38,0)BIGINT -> INT64NumberLong -> NUMBER(38,0)
DecimalNUMERIC(p,s) -> NUMBER(p,s)DECIMAL(p,s) -> NUMERICNumberDecimal -> NUMBER(38,18)
FloatREAL -> FLOATFLOAT -> FLOAT64double -> FLOAT
BooleanBOOLEAN -> BOOLEANTINYINT(1) -> BOOLbool -> BOOLEAN
Timestamp (tz)TIMESTAMPTZ -> TIMESTAMP_TZTIMESTAMP -> TIMESTAMPDate -> TIMESTAMP_NTZ
Timestamp (no tz)TIMESTAMP -> TIMESTAMP_NTZDATETIME -> DATETIME---
StringVARCHAR -> VARCHARVARCHAR(n) -> STRINGstring -> VARCHAR
TextTEXT -> VARCHARTEXT -> STRINGstring -> VARCHAR
BinaryBYTEA -> BINARYBLOB -> BYTESBinData -> BINARY
JSONJSONB -> VARIANTJSON -> JSON(native) -> VARIANT
UUIDUUID -> VARCHAR(36)------ -> VARCHAR(36)
EnumENUM -> VARCHARENUM -> STRING---
ArrayARRAY -> ARRAY---array -> ARRAY

Practical Example: Mixed-Type CDC Pipeline

Consider a PostgreSQL source table with several high-risk column types:

CREATE TABLE transactions (
    id              UUID DEFAULT gen_random_uuid(),
    created_at      TIMESTAMPTZ NOT NULL,
    processed_at    TIMESTAMP WITHOUT TIME ZONE,
    amount          NUMERIC(18, 4) NOT NULL,
    fee_percent     DOUBLE PRECISION,
    currency        VARCHAR(3),
    is_refund       BOOLEAN DEFAULT false,
    metadata        JSONB,
    receipt_pdf     BYTEA
);

When this table is captured via CDC and streamed to Snowflake, the pipeline must handle each column differently:

  • id: UUID serialized as a 36-character string. Map to VARCHAR(36).
  • created_at: TIMESTAMPTZ arrives as UTC epoch milliseconds. Map to TIMESTAMP_TZ.
  • processed_at: TIMESTAMP arrives as epoch milliseconds but has no timezone. Map to TIMESTAMP_NTZ and document the assumed timezone.
  • amount: NUMERIC(18,4) must preserve exact decimal precision. Serialize as string in the intermediary, map to NUMBER(18,4).
  • fee_percent: DOUBLE PRECISION is already floating-point. Map to FLOAT. Acceptable for non-financial percentages.
  • currency: VARCHAR(3) maps directly to VARCHAR(3).
  • is_refund: BOOLEAN maps directly to BOOLEAN.
  • metadata: JSONB maps to VARIANT in Snowflake with no transformation needed.
  • receipt_pdf: BYTEA is base64-encoded in transit. Map to BINARY and decode, or store as VARCHAR if downstream consumers expect base64.

Streamkap handles these mappings automatically for standard CDC pipelines, preserving decimal precision and timezone semantics without manual configuration. For edge cases - like the timezone-naive processed_at column - you can add a transformation step to apply explicit UTC conversion.

Defensive Patterns

Explicit CAST at the Destination

Never rely on implicit type coercion. If the destination supports it, use explicit CAST or TRY_CAST in your loading logic:

-- Snowflake: safe cast that returns NULL instead of error
SELECT TRY_CAST(raw_value AS NUMBER(18,4)) AS amount
FROM staging_table;

COALESCE for Default Values

When NULL semantics differ between source and destination, use COALESCE to apply sensible defaults:

-- Replace NULL with empty string for non-nullable destination columns
SELECT COALESCE(customer_name, '') AS customer_name
FROM staging_table;

-- Replace NULL boolean with false
SELECT COALESCE(is_active, false) AS is_active
FROM staging_table;

Validation Before Write

Add a validation step in the pipeline that checks for type mismatches before writing to the destination. Common checks include:

  • Timestamp range: Is the value between 1970-01-01 and 2100-01-01? Values outside this range often indicate epoch-format misinterpretation (seconds vs milliseconds vs microseconds).
  • Numeric bounds: Does the value fit in the destination column’s precision and scale? A NUMERIC(10,2) column cannot store 12345678901.23.
  • String length: Is the value within the destination’s VARCHAR limit?
  • Encoding validity: Is the byte sequence valid UTF-8?

Schema-Level Documentation

Maintain a type mapping document for every pipeline. When the source schema changes (a column type is altered, a new column is added), review the mapping before the change propagates. Automated schema evolution, as discussed in our guide on schema change management, handles structural changes - but type-level semantic decisions (is this new TIMESTAMP column timezone-aware?) still benefit from human review.

Type conversion in streaming pipelines is not glamorous work, but it is foundational. Get it right once, encode the rules in your pipeline configuration, and the data arrives at the destination exactly as the source intended. Get it wrong, and you spend weeks hunting down a rounding error in last quarter’s revenue report. The investment in explicit, well-documented type mapping pays for itself the first time it prevents a silent data corruption incident.