<--- Back to all resources
Data Type Conversion in Real-Time Pipelines
Learn how to handle data type conversions in streaming pipelines - timestamps, numeric precision, string encodings, and cross-database type mapping for reliable data delivery.
Every database speaks its own dialect when it comes to data types. PostgreSQL has TIMESTAMPTZ. MySQL has DATETIME. MongoDB stores dates as BSON Date objects. Snowflake has TIMESTAMP_LTZ, TIMESTAMP_NTZ, and TIMESTAMP_TZ. When you build a streaming pipeline that moves data between any two of these systems, every single row passes through a type conversion boundary - and that boundary is where pipelines silently break.
Type conversion errors rarely announce themselves with a loud crash. Instead, they show up as a timestamp shifted by eight hours in a dashboard, a financial total off by fractions of a cent that compound into material discrepancies, or a text field truncated mid-character because the destination assumed Latin-1 while the source sent UTF-8. These are the kinds of bugs that survive in production for weeks before someone notices.
This guide walks through the most common type conversion challenges in real-time CDC pipelines, the specific failure modes each one produces, and the defensive patterns that prevent them.
Timestamp Conversions
Timestamps are the single most error-prone data type in cross-database pipelines. The root problem is deceptively simple: different databases store time differently, and the conversion between representations is not always lossless.
TIMESTAMP vs TIMESTAMPTZ
PostgreSQL distinguishes between TIMESTAMP WITHOUT TIME ZONE (stores a bare date-time with no timezone context) and TIMESTAMP WITH TIME ZONE (converts input to UTC for storage, then converts back to the session timezone on retrieval). These look similar in a query result but behave completely differently during replication.
When a CDC pipeline reads a TIMESTAMPTZ value from the PostgreSQL WAL, it gets UTC. That is clean and unambiguous. But when it reads a TIMESTAMP WITHOUT TIME ZONE, it gets a bare value like 2026-02-25 14:30:00 with absolutely no indication of what timezone that belongs to. If the application server was in America/New_York, that value is Eastern Time. If the pipeline assumes UTC, every timestamp is now five hours off.
MySQL makes this worse. DATETIME stores a literal date-time with no timezone conversion. TIMESTAMP converts to UTC on storage and back to the session timezone on retrieval - but in the binary log used by CDC, MySQL emits TIMESTAMP columns as UTC integers and DATETIME as-is. If your pipeline does not know which column type it is reading, it cannot interpret the value correctly.
Epoch Formats and Precision
Some systems emit timestamps as Unix epochs. Debezium, the CDC engine underlying many streaming platforms including Streamkap, represents timestamps as milliseconds since epoch for TIMESTAMP types and as ISO-8601 strings or epoch days for DATE types. The precision matters: a source that stores microseconds will lose the last three digits if the pipeline truncates to milliseconds.
-- PostgreSQL: microsecond precision
SELECT now()::timestamp;
-- 2026-02-25 14:30:00.123456
-- Debezium CDC output: millisecond epoch
-- 1772115000123 (microseconds truncated)
-- Snowflake TIMESTAMP_NTZ: nanosecond precision
-- No data loss if mapped correctly, but the pipeline must preserve precision
Best Practice
Normalize every timestamp to UTC with explicit timezone metadata as early as possible in the pipeline. If the source is timezone-naive, document the assumed timezone and apply the conversion in the transformation layer before writing to the destination.
Numeric Precision
Financial data, scientific measurements, and aggregated metrics all depend on numeric precision. The gap between exact-decimal arithmetic and floating-point approximation is the source of some of the most expensive bugs in data engineering.
DECIMAL vs FLOAT vs INTEGER
DECIMAL(p, s) (also NUMERIC) stores exact values with a defined precision (total digits) and scale (digits after the decimal). FLOAT and DOUBLE store IEEE 754 floating-point approximations. INTEGER, BIGINT, and SMALLINT store whole numbers.
The dangerous conversion is DECIMAL to FLOAT. Consider:
-- PostgreSQL source
CREATE TABLE invoices (
amount NUMERIC(38, 10)
);
INSERT INTO invoices VALUES (1234567890.1234567890);
-- If pipeline converts to DOUBLE:
-- 1234567890.1234567165 (precision lost at the 7th decimal digit)
-- If pipeline maps to Snowflake NUMBER(38,10):
-- 1234567890.1234567890 (exact)
This is not a theoretical concern. If your pipeline intermediary format uses JSON (as many Kafka-based pipelines do), and the JSON serializer treats all numbers as IEEE 754 doubles, you lose precision on any DECIMAL with more than 15 significant digits. The fix is to serialize large decimals as strings and deserialize them as DECIMAL at the destination.
Integer Overflow
MySQL UNSIGNED BIGINT has a maximum value of 18,446,744,073,709,551,615. PostgreSQL does not have unsigned types, so its BIGINT tops out at 9,223,372,036,854,775,807. If your pipeline moves an unsigned bigint value above the signed maximum into a system that only supports signed integers, you get either an overflow error or silent wraparound to a negative number.
String and Encoding
Character Encoding
Most modern databases default to UTF-8, but “UTF-8” does not always mean the same thing. MySQL’s utf8 character set is actually a 3-byte subset that cannot store characters outside the Basic Multilingual Plane - including most emoji. MySQL’s utf8mb4 is true UTF-8. If a pipeline reads from a utf8mb4 column and writes to a system that expects 3-byte UTF-8, any 4-byte character (emoji, some CJK characters, mathematical symbols) will either be truncated or cause an error.
-- Source (MySQL utf8mb4): "Hello 🌍"
-- Destination with 3-byte UTF-8: "Hello " or "Hello ???" or error
String Length: CHAR vs VARCHAR vs TEXT
CHAR(10) is fixed-width and right-padded with spaces. VARCHAR(255) has a maximum length. TEXT is effectively unlimited. When moving data from TEXT to VARCHAR(255), any value longer than 255 characters is either truncated or rejected. The pipeline must either validate lengths or use a destination type with sufficient capacity.
PostgreSQL’s VARCHAR without a length specifier is effectively unlimited, but Snowflake’s VARCHAR defaults to 16,777,216 characters. BigQuery’s STRING has no length limit. Knowing the destination constraints before you configure the pipeline prevents runtime surprises.
Boolean and Enum
Boolean Representation
PostgreSQL BOOLEAN stores true/false. MySQL’s BOOLEAN is an alias for TINYINT(1), which stores 0 or 1. Some systems use 'T'/'F' strings, 'Y'/'N', or even 'yes'/'no'. A CDC pipeline reading MySQL’s boolean column gets an integer 0 or 1 from the binlog, not a native boolean.
If your destination is Snowflake, its BOOLEAN type accepts TRUE, FALSE, 0, 1, 'true', 'false', and several other variants. BigQuery is stricter---BOOL requires actual boolean values, not integers. The pipeline transformation layer must normalize the representation.
Enum Types
PostgreSQL ENUM is a custom type defined at the database level. In CDC output, enum values appear as their string representation. MySQL ENUM is defined inline in the column definition and stored as an integer index internally, but Debezium emits the string value. Both map cleanly to VARCHAR or STRING at the destination, but if you are relying on the ordinal position of the enum value, that information is lost in transit.
Binary and LOB Data
BYTEA in PostgreSQL, BLOB in MySQL, and Binary in MongoDB all store raw binary data. Most streaming pipeline intermediaries (Kafka with JSON or Avro serialization) cannot represent raw binary inline. The standard approach is base64 encoding.
-- Source: 48 bytes of binary data (a SHA-384 hash)
-- Kafka JSON: "aGVsbG8gd29ybGQ..." (base64 string, ~64 characters)
-- Destination: decode base64 back to binary, or store as string
The concern here is size. Base64 encoding increases data size by approximately 33%. For large binary objects (images, documents, serialized objects), this overhead compounds across millions of rows. Some destinations like Snowflake support BINARY types natively and can accept hex-encoded input, which is more compact than base64.
For very large objects (LOBs exceeding a few megabytes), the best practice is to store the binary data in object storage (S3, GCS) and stream only a reference URI through the pipeline.
NULL Semantics
NULL handling is one of those areas where every database agrees on the concept but disagrees on the implementation details.
PostgreSQL and MySQL both support NULL as a distinct value meaning “unknown” or “absent.” But an empty string ('') is a different value from NULL in both systems. Oracle, famously, treats empty strings as NULL. If your pipeline moves data from Oracle to PostgreSQL, every empty string becomes NULL at the source, and there is no way to recover the original intent.
MongoDB uses null as a BSON type, but a missing field is semantically different from a field explicitly set to null. CDC output may or may not distinguish between these cases depending on the connector configuration.
When writing to destinations, NULL handling affects default values and NOT NULL constraints. If the destination column has a DEFAULT clause, some databases apply the default when they receive NULL (if the column is NOT NULL) while others reject the insert. The pipeline must decide: pass NULL through and let the destination handle it, or apply COALESCE logic in the transformation layer.
Cross-Database Type Mapping Table
The following table shows common type mappings across popular source-destination pairs:
| Source Type | PostgreSQL to Snowflake | MySQL to BigQuery | MongoDB to Snowflake |
|---|---|---|---|
| Integer / int | INTEGER -> NUMBER(38,0) | INT -> INT64 | NumberInt -> NUMBER(38,0) |
| Big Integer | BIGINT -> NUMBER(38,0) | BIGINT -> INT64 | NumberLong -> NUMBER(38,0) |
| Decimal | NUMERIC(p,s) -> NUMBER(p,s) | DECIMAL(p,s) -> NUMERIC | NumberDecimal -> NUMBER(38,18) |
| Float | REAL -> FLOAT | FLOAT -> FLOAT64 | double -> FLOAT |
| Boolean | BOOLEAN -> BOOLEAN | TINYINT(1) -> BOOL | bool -> BOOLEAN |
| Timestamp (tz) | TIMESTAMPTZ -> TIMESTAMP_TZ | TIMESTAMP -> TIMESTAMP | Date -> TIMESTAMP_NTZ |
| Timestamp (no tz) | TIMESTAMP -> TIMESTAMP_NTZ | DATETIME -> DATETIME | --- |
| String | VARCHAR -> VARCHAR | VARCHAR(n) -> STRING | string -> VARCHAR |
| Text | TEXT -> VARCHAR | TEXT -> STRING | string -> VARCHAR |
| Binary | BYTEA -> BINARY | BLOB -> BYTES | BinData -> BINARY |
| JSON | JSONB -> VARIANT | JSON -> JSON | (native) -> VARIANT |
| UUID | UUID -> VARCHAR(36) | --- | --- -> VARCHAR(36) |
| Enum | ENUM -> VARCHAR | ENUM -> STRING | --- |
| Array | ARRAY -> ARRAY | --- | array -> ARRAY |
Practical Example: Mixed-Type CDC Pipeline
Consider a PostgreSQL source table with several high-risk column types:
CREATE TABLE transactions (
id UUID DEFAULT gen_random_uuid(),
created_at TIMESTAMPTZ NOT NULL,
processed_at TIMESTAMP WITHOUT TIME ZONE,
amount NUMERIC(18, 4) NOT NULL,
fee_percent DOUBLE PRECISION,
currency VARCHAR(3),
is_refund BOOLEAN DEFAULT false,
metadata JSONB,
receipt_pdf BYTEA
);
When this table is captured via CDC and streamed to Snowflake, the pipeline must handle each column differently:
- id:
UUIDserialized as a 36-character string. Map toVARCHAR(36). - created_at:
TIMESTAMPTZarrives as UTC epoch milliseconds. Map toTIMESTAMP_TZ. - processed_at:
TIMESTAMParrives as epoch milliseconds but has no timezone. Map toTIMESTAMP_NTZand document the assumed timezone. - amount:
NUMERIC(18,4)must preserve exact decimal precision. Serialize as string in the intermediary, map toNUMBER(18,4). - fee_percent:
DOUBLE PRECISIONis already floating-point. Map toFLOAT. Acceptable for non-financial percentages. - currency:
VARCHAR(3)maps directly toVARCHAR(3). - is_refund:
BOOLEANmaps directly toBOOLEAN. - metadata:
JSONBmaps toVARIANTin Snowflake with no transformation needed. - receipt_pdf:
BYTEAis base64-encoded in transit. Map toBINARYand decode, or store asVARCHARif downstream consumers expect base64.
Streamkap handles these mappings automatically for standard CDC pipelines, preserving decimal precision and timezone semantics without manual configuration. For edge cases - like the timezone-naive processed_at column - you can add a transformation step to apply explicit UTC conversion.
Defensive Patterns
Explicit CAST at the Destination
Never rely on implicit type coercion. If the destination supports it, use explicit CAST or TRY_CAST in your loading logic:
-- Snowflake: safe cast that returns NULL instead of error
SELECT TRY_CAST(raw_value AS NUMBER(18,4)) AS amount
FROM staging_table;
COALESCE for Default Values
When NULL semantics differ between source and destination, use COALESCE to apply sensible defaults:
-- Replace NULL with empty string for non-nullable destination columns
SELECT COALESCE(customer_name, '') AS customer_name
FROM staging_table;
-- Replace NULL boolean with false
SELECT COALESCE(is_active, false) AS is_active
FROM staging_table;
Validation Before Write
Add a validation step in the pipeline that checks for type mismatches before writing to the destination. Common checks include:
- Timestamp range: Is the value between
1970-01-01and2100-01-01? Values outside this range often indicate epoch-format misinterpretation (seconds vs milliseconds vs microseconds). - Numeric bounds: Does the value fit in the destination column’s precision and scale? A
NUMERIC(10,2)column cannot store12345678901.23. - String length: Is the value within the destination’s
VARCHARlimit? - Encoding validity: Is the byte sequence valid UTF-8?
Schema-Level Documentation
Maintain a type mapping document for every pipeline. When the source schema changes (a column type is altered, a new column is added), review the mapping before the change propagates. Automated schema evolution, as discussed in our guide on schema change management, handles structural changes - but type-level semantic decisions (is this new TIMESTAMP column timezone-aware?) still benefit from human review.
Type conversion in streaming pipelines is not glamorous work, but it is foundational. Get it right once, encode the rules in your pipeline configuration, and the data arrives at the destination exactly as the source intended. Get it wrong, and you spend weeks hunting down a rounding error in last quarter’s revenue report. The investment in explicit, well-documented type mapping pays for itself the first time it prevents a silent data corruption incident.