<--- Back to all resources
Flattening Nested JSON in Streaming Pipelines
Learn how to flatten deeply nested JSON structures in real-time streaming pipelines. Handle arrays, nested objects, and mixed schemas for analytics-ready output.
If you have ever pulled data from MongoDB, a REST API, or an event bus, you have almost certainly encountered deeply nested JSON. A single document might contain objects within objects, arrays of sub-documents, and fields that appear in some records but not others. This structure is natural for applications, but it is a serious obstacle for analytics.
Data warehouses and BI tools expect flat, tabular rows. When you load a nested JSON document into Snowflake or BigQuery without flattening it first, analysts are forced to write verbose extraction logic in every single query. Aggregations become painful, joins become unpredictable, and dashboards break when the schema shifts. Flattening nested JSON before it reaches the warehouse solves these problems at the source, turning complex documents into clean, queryable tables.
This guide walks through the types of nesting you will encounter, the strategies for flattening them, and the practical SQL you need in both Flink and Snowflake to get the job done.
Types of Nesting in JSON Documents
Before choosing a flattening strategy, you need to understand the four common nesting patterns.
Nested Objects
The simplest form. A field contains another JSON object with its own key-value pairs.
{
"order_id": "ORD-1001",
"customer": {
"name": "Alice Martin",
"email": "alice@example.com"
}
}
Here, customer is a nested object. Flattening promotes customer.name and customer.email to top-level columns like customer_name and customer_email.
Arrays of Objects
A field contains a list of sub-documents, each with its own structure.
{
"order_id": "ORD-1001",
"line_items": [
{ "sku": "WIDGET-A", "qty": 2, "price": 15.00 },
{ "sku": "WIDGET-B", "qty": 1, "price": 30.00 }
]
}
This is the most impactful nesting pattern because flattening it multiplies rows. One order with three line items becomes three rows in the output.
Arrays of Primitives
A field contains a simple list of values rather than objects.
{
"order_id": "ORD-1001",
"tags": ["priority", "express", "fragile"]
}
These are simpler to handle but still require a decision: explode into rows, or concatenate into a single string column.
Mixed and Polymorphic Schemas
The most difficult case. A field’s structure varies between records. One document has shipping as an object with address and carrier fields; another has shipping as a plain string like "standard". This is common in event-driven systems where the schema evolves over time or differs by event type.
Flattening Strategies
Dot-Notation Promotion
The most common approach. Walk the nested structure and promote each leaf field to a top-level column, using underscores or dots to preserve the path.
customer.address.city becomes customer_address_city.
This works well for nested objects with a fixed schema. It is deterministic, produces a predictable number of columns, and does not change the row count.
Array Explosion (UNNEST / LATERAL FLATTEN)
For array fields, you create one output row per array element. The parent fields are duplicated across each row. This is sometimes called “unnesting” or “exploding” the array.
A single order with 3 line items produces 3 output rows, each carrying the same order_id, customer_name, and order_date, but with different line_item_sku, line_item_qty, and line_item_price values.
Hybrid Approach
In practice, most documents need both strategies. You flatten nested objects with dot-notation promotion and explode arrays into rows. For documents with multiple independent arrays, you typically explode one array per output table to avoid a Cartesian product.
VARIANT / JSON Column Storage
Sometimes, flattening is not the right answer. If the nested structure is queried infrequently or its schema is highly variable, storing the raw JSON in a VARIANT column (Snowflake) or JSON column (BigQuery, PostgreSQL) and using schema-on-read queries is a pragmatic alternative.
Flink SQL Examples
Apache Flink is a natural fit for flattening JSON in streaming pipelines because it processes records one at a time as they arrive.
Extracting Nested Fields with JSON_VALUE
SELECT
JSON_VALUE(raw_json, '$.order_id') AS order_id,
JSON_VALUE(raw_json, '$.customer.name') AS customer_name,
JSON_VALUE(raw_json, '$.customer.email') AS customer_email,
JSON_VALUE(raw_json, '$.shipping.address.city') AS shipping_city
FROM raw_orders;
JSON_VALUE extracts a scalar value from a JSON string. For nested objects, you use standard JSON path notation.
Exploding Arrays with UNNEST
SELECT
o.order_id,
o.customer_name,
li.sku,
li.qty,
li.price
FROM orders AS o
CROSS JOIN UNNEST(o.line_items) AS li(sku, qty, price);
UNNEST expands an array column into rows. Combined with CROSS JOIN, each array element joins back to its parent row. This is how you explode line items, tags, or any repeated structure.
Handling Nulls in Flink
SELECT
JSON_VALUE(raw_json, '$.order_id') AS order_id,
IFNULL(
JSON_VALUE(raw_json, '$.customer.name'),
'Unknown'
) AS customer_name
FROM raw_orders;
IFNULL provides a default when a nested path resolves to null, preventing downstream failures.
Snowflake Examples
Snowflake’s VARIANT type and LATERAL FLATTEN function make it one of the most capable warehouses for working with nested JSON.
Parsing and Querying with Dot Notation
SELECT
raw:order_id::STRING AS order_id,
raw:customer.name::STRING AS customer_name,
raw:customer.email::STRING AS customer_email,
raw:shipping.address.city::STRING AS shipping_city
FROM raw_orders;
Once JSON is loaded into a VARIANT column (raw), Snowflake supports dot notation to navigate the hierarchy. The ::STRING cast converts the extracted value to a typed column.
Exploding Arrays with LATERAL FLATTEN
SELECT
raw:order_id::STRING AS order_id,
raw:customer.name::STRING AS customer_name,
f.value:sku::STRING AS line_item_sku,
f.value:qty::INT AS line_item_qty,
f.value:price::FLOAT AS line_item_price
FROM raw_orders,
LATERAL FLATTEN(input => raw:line_items) f;
LATERAL FLATTEN is Snowflake’s equivalent of UNNEST. Each element in the line_items array becomes a row, accessible through f.value.
Preserving Rows for Empty or Null Arrays
SELECT
raw:order_id::STRING AS order_id,
f.value:sku::STRING AS line_item_sku
FROM raw_orders,
LATERAL FLATTEN(input => raw:line_items, OUTER => TRUE) f;
Adding OUTER => TRUE ensures that orders with no line items still appear in the output with NULL values for the array fields, similar to a LEFT JOIN.
Handling Edge Cases
Null Nested Objects
A record where the entire nested object is null (e.g., "shipping": null) will cause dot-notation extraction to return null for all child fields. This is usually safe, but if your downstream logic does not expect nulls in those columns, wrap extractions with COALESCE or IFNULL.
Empty Arrays
An empty array ("line_items": []) produces zero rows after explosion. If you need the parent row to survive, use OUTER => TRUE in Snowflake or a LEFT JOIN UNNEST in Flink.
Deeply Nested Structures (3+ Levels)
For documents with three or more levels of nesting, chain your extractions. In Snowflake: raw:order.shipping.address.zip::STRING. In Flink: JSON_VALUE(raw_json, '$.order.shipping.address.zip'). Avoid flattening every level into column names like order_shipping_address_zip_code unless the structure is truly stable. For volatile deep nesting, consider keeping the sub-document as a VARIANT/JSON column.
Polymorphic Types
When a field can be either a string or an object depending on the record, use conditional logic. In Snowflake:
SELECT
CASE
WHEN TYPEOF(raw:shipping) = 'VARCHAR'
THEN raw:shipping::STRING
ELSE raw:shipping.carrier::STRING
END AS shipping_carrier
FROM raw_orders;
This prevents type-mismatch errors on records where the schema differs.
Schema Inference and Management
Schema-on-Write vs Schema-on-Read
Schema-on-write means you define the flat target schema before loading, and the pipeline enforces it. Missing fields get defaults; extra fields are dropped or routed to an overflow column. This is the safer approach for production analytics tables.
Schema-on-read means you load the raw JSON into a VARIANT column and extract fields at query time. This is faster to set up but pushes complexity to every analyst who writes a query.
Auto-Detecting Schema from JSON Samples
Tools like Flink’s JSON format and Snowflake’s INFER_SCHEMA function can automatically detect the structure of JSON documents. However, auto-detection based on a small sample can miss fields that only appear in rare records. Always validate inferred schemas against a representative dataset before deploying to production.
Performance Considerations
Flattening adds processing overhead, but the magnitude depends on the strategy.
Dot-notation promotion is cheap. Extracting scalar fields from a JSON document is a fast operation in both Flink and Snowflake. The row count stays the same, and the output is typically smaller than the input because you discard unneeded fields.
Array explosion can be expensive. If an average document has 10 array elements, your output table has 10x the row count of the input. This multiplies storage, increases query scan times, and can cause significant backpressure in streaming pipelines. Before exploding, consider whether downstream queries actually need per-element rows or whether an aggregated representation (e.g., total_items, total_price) would suffice.
Streaming overhead: In a streaming pipeline, flattening happens on every record as it arrives. For Flink jobs, ensure that the JSON parsing and field extraction do not become a bottleneck by monitoring operator backpressure. Streamkap handles this automatically for CDC pipelines from sources like MongoDB and DynamoDB, applying flattening transformations as part of the pipeline configuration without requiring you to write and maintain Flink jobs.
Practical Example: MongoDB Orders to Flat Snowflake Table
Consider a MongoDB orders collection where each document looks like this:
{
"_id": "65a1b2c3d4e5f6a7b8c9d0e1",
"order_id": "ORD-1001",
"created_at": "2026-02-20T14:30:00Z",
"customer": {
"name": "Alice Martin",
"email": "alice@example.com",
"tier": "premium"
},
"line_items": [
{ "sku": "WIDGET-A", "qty": 2, "unit_price": 15.00 },
{ "sku": "WIDGET-B", "qty": 1, "unit_price": 30.00 }
],
"shipping": {
"address": { "city": "Portland", "state": "OR", "zip": "97201" },
"carrier": "UPS",
"tracking": "1Z999AA10123456784"
}
}
The target Snowflake table fact_order_lines should look like:
| order_id | created_at | customer_name | customer_tier | sku | qty | unit_price | shipping_city | shipping_carrier |
|---|---|---|---|---|---|---|---|---|
| ORD-1001 | 2026-02-20 14:30:00 | Alice Martin | premium | WIDGET-A | 2 | 15.00 | Portland | UPS |
| ORD-1001 | 2026-02-20 14:30:00 | Alice Martin | premium | WIDGET-B | 1 | 30.00 | Portland | UPS |
The Snowflake SQL to produce this:
SELECT
raw:order_id::STRING AS order_id,
raw:created_at::TIMESTAMP AS created_at,
raw:customer.name::STRING AS customer_name,
raw:customer.tier::STRING AS customer_tier,
f.value:sku::STRING AS sku,
f.value:qty::INT AS qty,
f.value:unit_price::FLOAT AS unit_price,
raw:shipping.address.city::STRING AS shipping_city,
raw:shipping.carrier::STRING AS shipping_carrier
FROM raw_orders,
LATERAL FLATTEN(input => raw:line_items, OUTER => TRUE) f;
This combines dot-notation promotion for the customer and shipping objects with array explosion for line_items, producing one row per line item with all parent context preserved.
When NOT to Flatten
Flattening is not always the right choice. Consider keeping JSON in a VARIANT or JSON column when:
- The schema is highly variable and new fields appear frequently. Flattening would require constant schema migrations.
- The nested data is only queried occasionally. If only 5% of queries touch the nested structure, the cost of maintaining a flattened table may not be justified.
- The array cardinality is very high. An event with 1,000 nested elements would produce 1,000 rows after explosion. Storing the raw array and querying it with ARRAY_AGG or ARRAY_SIZE is often more practical.
- You need the original document for replay or debugging. Always keep a raw copy alongside your flattened tables so you can re-derive the flat representation if your flattening logic changes.
The best architectures often combine both approaches: a raw VARIANT table for flexibility and auditing, plus materialized flattened views for the queries that need performance and simplicity. The key is to flatten deliberately, understanding the trade-offs in row multiplication, schema rigidity, and query complexity.