Why do I need to flatten JSON in streaming pipelines?

Most analytics destinations (Snowflake, BigQuery, ClickHouse, Redshift) work best with flat, tabular data. Nested JSON from MongoDB, APIs, or event systems creates deeply nested structures that are difficult to query with standard SQL. Flattening converts these into columns that analysts can use directly without writing complex extraction logic in every query.

What is the difference between flattening and exploding nested JSON?

Flattening takes nested object fields and promotes them to top-level columns (address.city → address_city). Exploding (or unnesting) takes array fields and creates one output row per array element, converting a single record with 3 line items into 3 rows. Both are often needed together for complex documents.

How do I handle missing or null nested fields during flattening?

Use COALESCE or null-safe navigation in your flattening logic. In Flink SQL, use optional field access with IFNULL. In Snowflake, LATERAL FLATTEN with OUTER => true preserves rows even when the array is empty or null. Always define default values for missing fields rather than letting the pipeline fail.

Can Streamkap flatten nested JSON automatically?

Yes. Streamkap's built-in transformations can flatten nested JSON from MongoDB, DynamoDB, and other NoSQL sources into flat columns suitable for warehouses. You configure flattening rules in the pipeline setup, specifying which nested fields to promote, which arrays to explode, and how to handle missing fields.

<--- Back to all resources

Engineering

February 25, 2026

9 min read

Flattening Nested JSON in Streaming Pipelines

Learn how to flatten deeply nested JSON structures in real-time streaming pipelines. Handle arrays, nested objects, and mixed schemas for analytics-ready output.

TL;DR: • Nested JSON from APIs, NoSQL databases, and event systems must be flattened into tabular rows before loading into data warehouses and analytics tools. • The two main approaches are dot-notation flattening (nested.field.name → nested_field_name columns) and array explosion (one nested array element → one output row). • Handling mixed schemas, null nested objects, and deeply nested arrays requires defensive flattening logic that does not fail on missing fields.

If you have ever pulled data from MongoDB, a REST API, or an event bus, you have almost certainly encountered deeply nested JSON. A single document might contain objects within objects, arrays of sub-documents, and fields that appear in some records but not others. This structure is natural for applications, but it is a serious obstacle for analytics.

Data warehouses and BI tools expect flat, tabular rows. When you load a nested JSON document into Snowflake or BigQuery without flattening it first, analysts are forced to write verbose extraction logic in every single query. Aggregations become painful, joins become unpredictable, and dashboards break when the schema shifts. Flattening nested JSON before it reaches the warehouse solves these problems at the source, turning complex documents into clean, queryable tables.

This guide walks through the types of nesting you will encounter, the strategies for flattening them, and the practical SQL you need in both Flink and Snowflake to get the job done.

Types of Nesting in JSON Documents

Before choosing a flattening strategy, you need to understand the four common nesting patterns.

Nested Objects

The simplest form. A field contains another JSON object with its own key-value pairs.

{
  "order_id": "ORD-1001",
  "customer": {
    "name": "Alice Martin",
    "email": "alice@example.com"
  }
}

Here, customer is a nested object. Flattening promotes customer.name and customer.email to top-level columns like customer_name and customer_email.

Arrays of Objects

A field contains a list of sub-documents, each with its own structure.

{
  "order_id": "ORD-1001",
  "line_items": [
    { "sku": "WIDGET-A", "qty": 2, "price": 15.00 },
    { "sku": "WIDGET-B", "qty": 1, "price": 30.00 }
  ]
}

This is the most impactful nesting pattern because flattening it multiplies rows. One order with three line items becomes three rows in the output.

Arrays of Primitives

A field contains a simple list of values rather than objects.

{
  "order_id": "ORD-1001",
  "tags": ["priority", "express", "fragile"]
}

These are simpler to handle but still require a decision: explode into rows, or concatenate into a single string column.

Mixed and Polymorphic Schemas

The most difficult case. A field’s structure varies between records. One document has shipping as an object with address and carrier fields; another has shipping as a plain string like "standard". This is common in event-driven systems where the schema evolves over time or differs by event type.

Flattening Strategies

Dot-Notation Promotion

The most common approach. Walk the nested structure and promote each leaf field to a top-level column, using underscores or dots to preserve the path.

customer.address.city becomes customer_address_city.

This works well for nested objects with a fixed schema. It is deterministic, produces a predictable number of columns, and does not change the row count.

Array Explosion (UNNEST / LATERAL FLATTEN)

For array fields, you create one output row per array element. The parent fields are duplicated across each row. This is sometimes called “unnesting” or “exploding” the array.

A single order with 3 line items produces 3 output rows, each carrying the same order_id, customer_name, and order_date, but with different line_item_sku, line_item_qty, and line_item_price values.

Hybrid Approach

In practice, most documents need both strategies. You flatten nested objects with dot-notation promotion and explode arrays into rows. For documents with multiple independent arrays, you typically explode one array per output table to avoid a Cartesian product.

VARIANT / JSON Column Storage

Sometimes, flattening is not the right answer. If the nested structure is queried infrequently or its schema is highly variable, storing the raw JSON in a VARIANT column (Snowflake) or JSON column (BigQuery, PostgreSQL) and using schema-on-read queries is a pragmatic alternative.

Flink SQL Examples

Apache Flink is a natural fit for flattening JSON in streaming pipelines because it processes records one at a time as they arrive.

Extracting Nested Fields with JSON_VALUE

SELECT
  JSON_VALUE(raw_json, '$.order_id')                AS order_id,
  JSON_VALUE(raw_json, '$.customer.name')           AS customer_name,
  JSON_VALUE(raw_json, '$.customer.email')          AS customer_email,
  JSON_VALUE(raw_json, '$.shipping.address.city')   AS shipping_city
FROM raw_orders;

JSON_VALUE extracts a scalar value from a JSON string. For nested objects, you use standard JSON path notation.

Exploding Arrays with UNNEST

SELECT
  o.order_id,
  o.customer_name,
  li.sku,
  li.qty,
  li.price
FROM orders AS o
CROSS JOIN UNNEST(o.line_items) AS li(sku, qty, price);

UNNEST expands an array column into rows. Combined with CROSS JOIN, each array element joins back to its parent row. This is how you explode line items, tags, or any repeated structure.

Handling Nulls in Flink

SELECT
  JSON_VALUE(raw_json, '$.order_id') AS order_id,
  IFNULL(
    JSON_VALUE(raw_json, '$.customer.name'),
    'Unknown'
  ) AS customer_name
FROM raw_orders;

IFNULL provides a default when a nested path resolves to null, preventing downstream failures.

Snowflake Examples

Snowflake’s VARIANT type and LATERAL FLATTEN function make it one of the most capable warehouses for working with nested JSON.

Parsing and Querying with Dot Notation

SELECT
  raw:order_id::STRING                   AS order_id,
  raw:customer.name::STRING              AS customer_name,
  raw:customer.email::STRING             AS customer_email,
  raw:shipping.address.city::STRING      AS shipping_city
FROM raw_orders;

Once JSON is loaded into a VARIANT column (raw), Snowflake supports dot notation to navigate the hierarchy. The ::STRING cast converts the extracted value to a typed column.

Exploding Arrays with LATERAL FLATTEN

SELECT
  raw:order_id::STRING            AS order_id,
  raw:customer.name::STRING       AS customer_name,
  f.value:sku::STRING             AS line_item_sku,
  f.value:qty::INT                AS line_item_qty,
  f.value:price::FLOAT            AS line_item_price
FROM raw_orders,
LATERAL FLATTEN(input => raw:line_items) f;

LATERAL FLATTEN is Snowflake’s equivalent of UNNEST. Each element in the line_items array becomes a row, accessible through f.value.

Preserving Rows for Empty or Null Arrays

SELECT
  raw:order_id::STRING            AS order_id,
  f.value:sku::STRING             AS line_item_sku
FROM raw_orders,
LATERAL FLATTEN(input => raw:line_items, OUTER => TRUE) f;

Adding OUTER => TRUE ensures that orders with no line items still appear in the output with NULL values for the array fields, similar to a LEFT JOIN.

Handling Edge Cases

Null Nested Objects

A record where the entire nested object is null (e.g., "shipping": null) will cause dot-notation extraction to return null for all child fields. This is usually safe, but if your downstream logic does not expect nulls in those columns, wrap extractions with COALESCE or IFNULL.

Empty Arrays

An empty array ("line_items": []) produces zero rows after explosion. If you need the parent row to survive, use OUTER => TRUE in Snowflake or a LEFT JOIN UNNEST in Flink.

Deeply Nested Structures (3+ Levels)

For documents with three or more levels of nesting, chain your extractions. In Snowflake: raw:order.shipping.address.zip::STRING. In Flink: JSON_VALUE(raw_json, '$.order.shipping.address.zip'). Avoid flattening every level into column names like order_shipping_address_zip_code unless the structure is truly stable. For volatile deep nesting, consider keeping the sub-document as a VARIANT/JSON column.

Polymorphic Types

When a field can be either a string or an object depending on the record, use conditional logic. In Snowflake:

SELECT
  CASE
    WHEN TYPEOF(raw:shipping) = 'VARCHAR'
    THEN raw:shipping::STRING
    ELSE raw:shipping.carrier::STRING
  END AS shipping_carrier
FROM raw_orders;

This prevents type-mismatch errors on records where the schema differs.

Schema Inference and Management

Schema-on-Write vs Schema-on-Read

Schema-on-write means you define the flat target schema before loading, and the pipeline enforces it. Missing fields get defaults; extra fields are dropped or routed to an overflow column. This is the safer approach for production analytics tables.

Schema-on-read means you load the raw JSON into a VARIANT column and extract fields at query time. This is faster to set up but pushes complexity to every analyst who writes a query.

Auto-Detecting Schema from JSON Samples

Tools like Flink’s JSON format and Snowflake’s INFER_SCHEMA function can automatically detect the structure of JSON documents. However, auto-detection based on a small sample can miss fields that only appear in rare records. Always validate inferred schemas against a representative dataset before deploying to production.

Performance Considerations

Flattening adds processing overhead, but the magnitude depends on the strategy.

Dot-notation promotion is cheap. Extracting scalar fields from a JSON document is a fast operation in both Flink and Snowflake. The row count stays the same, and the output is typically smaller than the input because you discard unneeded fields.

Array explosion can be expensive. If an average document has 10 array elements, your output table has 10x the row count of the input. This multiplies storage, increases query scan times, and can cause significant backpressure in streaming pipelines. Before exploding, consider whether downstream queries actually need per-element rows or whether an aggregated representation (e.g., total_items, total_price) would suffice.

Streaming overhead: In a streaming pipeline, flattening happens on every record as it arrives. For Flink jobs, ensure that the JSON parsing and field extraction do not become a bottleneck by monitoring operator backpressure. Streamkap handles this automatically for CDC pipelines from sources like MongoDB and DynamoDB, applying flattening transformations as part of the pipeline configuration without requiring you to write and maintain Flink jobs.

Practical Example: MongoDB Orders to Flat Snowflake Table

Consider a MongoDB orders collection where each document looks like this:

{
  "_id": "65a1b2c3d4e5f6a7b8c9d0e1",
  "order_id": "ORD-1001",
  "created_at": "2026-02-20T14:30:00Z",
  "customer": {
    "name": "Alice Martin",
    "email": "alice@example.com",
    "tier": "premium"
  },
  "line_items": [
    { "sku": "WIDGET-A", "qty": 2, "unit_price": 15.00 },
    { "sku": "WIDGET-B", "qty": 1, "unit_price": 30.00 }
  ],
  "shipping": {
    "address": { "city": "Portland", "state": "OR", "zip": "97201" },
    "carrier": "UPS",
    "tracking": "1Z999AA10123456784"
  }
}

The target Snowflake table fact_order_lines should look like:

order_id	created_at	customer_name	customer_tier	sku	qty	unit_price	shipping_city	shipping_carrier
ORD-1001	2026-02-20 14:30:00	Alice Martin	premium	WIDGET-A	2	15.00	Portland	UPS
ORD-1001	2026-02-20 14:30:00	Alice Martin	premium	WIDGET-B	1	30.00	Portland	UPS

The Snowflake SQL to produce this:

SELECT
  raw:order_id::STRING                       AS order_id,
  raw:created_at::TIMESTAMP                  AS created_at,
  raw:customer.name::STRING                  AS customer_name,
  raw:customer.tier::STRING                  AS customer_tier,
  f.value:sku::STRING                        AS sku,
  f.value:qty::INT                           AS qty,
  f.value:unit_price::FLOAT                  AS unit_price,
  raw:shipping.address.city::STRING          AS shipping_city,
  raw:shipping.carrier::STRING               AS shipping_carrier
FROM raw_orders,
LATERAL FLATTEN(input => raw:line_items, OUTER => TRUE) f;

This combines dot-notation promotion for the customer and shipping objects with array explosion for line_items, producing one row per line item with all parent context preserved.

When NOT to Flatten

Flattening is not always the right choice. Consider keeping JSON in a VARIANT or JSON column when:

The schema is highly variable and new fields appear frequently. Flattening would require constant schema migrations.
The nested data is only queried occasionally. If only 5% of queries touch the nested structure, the cost of maintaining a flattened table may not be justified.
The array cardinality is very high. An event with 1,000 nested elements would produce 1,000 rows after explosion. Storing the raw array and querying it with ARRAY_AGG or ARRAY_SIZE is often more practical.
You need the original document for replay or debugging. Always keep a raw copy alongside your flattened tables so you can re-derive the flat representation if your flattening logic changes.

The best architectures often combine both approaches: a raw VARIANT table for flexibility and auditing, plus materialized flattened views for the queries that need performance and simplicity. The key is to flatten deliberately, understanding the trade-offs in row multiplication, schema rigidity, and query complexity.

Platform

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company