<--- Back to all resources

Engineering

February 25, 2026

10 min read

DynamoDB to Snowflake: Syncing NoSQL to Your Data Warehouse

Stream DynamoDB changes to Snowflake in real time using DynamoDB Streams and CDC. Learn how to handle document flattening, schema mapping, and keep analytics fresh.

TL;DR: • DynamoDB Streams captures every item-level change (INSERT, MODIFY, REMOVE) and can stream them to Snowflake for real-time analytics on NoSQL data. • The key challenge is mapping DynamoDB's schema-less, nested items to Snowflake's relational tables, which requires flattening, type conversion, and handling of the DynamoDB JSON format. • Managed CDC platforms abstract away DynamoDB Streams consumption, shard management, and Snowflake loading into a simple configuration.

DynamoDB excels as a low-latency key-value store for transactional workloads, but its query model is deliberately limited. You can look up items by partition key and sort key, run scans with filter expressions, and that is about it. The moment your team needs to join orders with customer profiles, calculate rolling revenue windows, or feed a machine learning feature store, you need that data somewhere else. Snowflake is a natural destination: it handles semi-structured data natively, scales compute independently from storage, and speaks the SQL your analysts already know.

The challenge is getting data from a schema-less NoSQL store into a columnar warehouse without losing fidelity, missing updates, or building a fragile pipeline you dread maintaining. This guide walks through the architecture, the type-system quirks, and the operational decisions you will face when streaming DynamoDB changes to Snowflake.

How DynamoDB Streams Works

DynamoDB Streams is a change data capture mechanism built into DynamoDB. When enabled on a table, every item-level write (put, update, delete) produces a stream record that is durably stored in a time-ordered sequence.

Stream View Types

You configure streams with one of four view types:

View TypeRecord Contains
KEYS_ONLYPartition key and sort key of the modified item
NEW_IMAGEThe entire item as it appears after the modification
OLD_IMAGEThe entire item as it appeared before the modification
NEW_AND_OLD_IMAGESBoth before and after images of the item

For CDC to Snowflake, always choose NEW_AND_OLD_IMAGES. This gives you the full before-and-after state needed to reconstruct updates and compute diffs downstream.

Shard Model and Retention

Stream records are organized into shards, which are similar to Kafka partitions. Each shard has a fixed capacity and lifespan. As your table’s throughput changes, DynamoDB automatically splits and merges shards. Critically, stream records are retained for only 24 hours. If your consumer falls behind by more than a day, those records are gone. This constraint makes reliable, always-on consumption essential.

Table Write → Stream Record → Shard → Consumer (must read within 24h)

Each shard provides an iterator that your consumer advances through. When a shard closes (due to a split or merge), you must discover the child shards and continue reading from them. Managing this shard lineage is one of the more tedious parts of building a DynamoDB Streams consumer from scratch.

Architecture Options

There are three common patterns for moving DynamoDB data into Snowflake.

Option 1: Streams to Lambda to S3 to Snowpipe

This is the most widely documented approach. A Lambda function is triggered by DynamoDB Streams, transforms each record into a JSON or Parquet file, writes it to S3, and Snowpipe picks it up.

DynamoDB Streams → Lambda → S3 (staging) → Snowpipe → Snowflake

Pros: Serverless, no infrastructure to manage, fine-grained control over transformation logic.

Cons: Lambda concurrency limits, cold starts, error handling complexity, S3 file fragmentation (many small files hurt Snowflake query performance), and you own every line of the transformation code.

Option 2: Streams to Kinesis Data Streams to Firehose to S3

DynamoDB can replicate its stream into a Kinesis Data Stream, which then feeds Kinesis Data Firehose for buffered delivery to S3.

DynamoDB Streams → Kinesis Data Streams → Firehose → S3 → Snowpipe → Snowflake

Pros: Firehose handles buffering and batching (reducing small-file problems), Kinesis provides longer retention than DynamoDB Streams alone.

Cons: Additional AWS services to configure and pay for, Firehose adds latency due to buffering windows (minimum 60 seconds), and you still need transformation logic for the DynamoDB JSON format.

Option 3: Managed CDC Platform

A managed CDC platform like Streamkap connects directly to DynamoDB Streams as a source and writes to Snowflake as a destination. The platform handles shard management, checkpointing, type conversion, and Snowflake loading internally.

DynamoDB Streams → Streamkap → Snowflake

Pros: No Lambda functions, no S3 staging buckets, no Snowpipe configuration, automatic schema mapping and type conversion. Cons: Adds a third-party dependency.

The DynamoDB Type System

DynamoDB does not store plain JSON. It uses a typed attribute format where every value is wrapped in a type descriptor. This is the single biggest source of friction when loading DynamoDB data into a relational system.

Here are the DynamoDB type descriptors and their Snowflake equivalents:

DynamoDB TypeDescriptorExampleSnowflake Type
StringS{"S": "hello"}VARCHAR
NumberN{"N": "42.5"}NUMBER or FLOAT
BinaryB{"B": "dGhpcw=="}BINARY
BooleanBOOL{"BOOL": true}BOOLEAN
NullNULL{"NULL": true}NULL
ListL{"L": [{"S": "a"}, {"N": "1"}]}VARIANT (array)
MapM{"M": {"key": {"S": "val"}}}VARIANT (object)
String SetSS{"SS": ["a", "b"]}VARIANT (array)
Number SetNS{"NS": ["1", "2"]}VARIANT (array)
Binary SetBS{"BS": ["dGhpcw=="]}VARIANT (array)

A raw DynamoDB stream record for a simple order item looks like this:

{
  "orderId": {"S": "ORD-2026-001"},
  "customerId": {"S": "CUST-4821"},
  "totalAmount": {"N": "149.99"},
  "items": {"L": [
    {"M": {
      "sku": {"S": "SKU-100"},
      "qty": {"N": "2"},
      "price": {"N": "49.99"}
    }},
    {"M": {
      "sku": {"S": "SKU-205"},
      "qty": {"N": "1"},
      "price": {"N": "50.01"}
    }}
  ]},
  "isPrime": {"BOOL": true},
  "createdAt": {"S": "2026-02-25T10:30:00Z"}
}

Before this can land in Snowflake as a usable row, every type descriptor must be unwrapped. The {"S": "ORD-2026-001"} wrapper becomes just "ORD-2026-001", and {"N": "149.99"} becomes the number 149.99. Numbers in DynamoDB are always strings inside the descriptor, so you also need to cast them to numeric types.

Document Flattening Strategies

Once type descriptors are stripped, you still need to decide how a nested, schema-less document maps to Snowflake’s columnar model.

Strategy 1: Full VARIANT Storage

Store each DynamoDB item as a single VARIANT column in Snowflake and use dot-notation queries at read time.

CREATE TABLE orders (
  pk VARCHAR,
  sk VARCHAR,
  item VARIANT,
  _streamkap_ts TIMESTAMP_NTZ
);

-- Query nested data
SELECT
  item:orderId::VARCHAR AS order_id,
  item:totalAmount::NUMBER(10,2) AS total,
  item:items[0]:sku::VARCHAR AS first_sku
FROM orders;

Pros: Zero schema management, handles any item shape. Cons: No columnar pruning benefits, every query must parse the VARIANT, harder for analysts unfamiliar with semi-structured syntax.

Strategy 2: Top-Level Flattening

Extract top-level attributes into dedicated columns during ingestion.

CREATE TABLE orders (
  order_id VARCHAR,
  customer_id VARCHAR,
  total_amount NUMBER(10,2),
  is_prime BOOLEAN,
  created_at TIMESTAMP_NTZ,
  items VARIANT,   -- nested list stays as VARIANT
  _cdc_operation VARCHAR,
  _cdc_timestamp TIMESTAMP_NTZ
);

Pros: Standard SQL queries, columnar storage benefits, analysts can use the table immediately. Cons: Requires defining the schema upfront, new top-level attributes need schema evolution.

Strategy 3: Hybrid

Keep a flattened set of known columns plus a raw_item VARIANT column containing the full document. This gives you the best of both approaches: fast queries on known columns, and access to the full document when you need it.

Handling Updates and Deletes

With NEW_AND_OLD_IMAGES enabled, every stream record includes an eventName field:

  • INSERT: A new item was added. Only NewImage is present.
  • MODIFY: An existing item was updated. Both OldImage and NewImage are present.
  • REMOVE: An item was deleted. Only OldImage is present.

For Snowflake, you typically implement one of two patterns:

Append-only (immutable log): Write every event as a new row. Use QUALIFY ROW_NUMBER() OVER (PARTITION BY pk ORDER BY _cdc_timestamp DESC) = 1 to get the latest state. This preserves full history but increases storage.

Upsert (merge): Use Snowflake MERGE statements to apply inserts, updates, and deletes to a current-state table. This keeps the table compact and queryable with simple SELECT statements, but loses history unless you also maintain a separate audit table.

MERGE INTO orders_current AS target
USING orders_staging AS source
ON target.order_id = source.order_id
WHEN MATCHED AND source._cdc_operation = 'REMOVE' THEN DELETE
WHEN MATCHED THEN UPDATE SET
  target.customer_id = source.customer_id,
  target.total_amount = source.total_amount,
  target.is_prime = source.is_prime,
  target.updated_at = source._cdc_timestamp
WHEN NOT MATCHED AND source._cdc_operation != 'REMOVE' THEN INSERT
  (order_id, customer_id, total_amount, is_prime, updated_at)
  VALUES (source.order_id, source.customer_id, source.total_amount, source.is_prime, source._cdc_timestamp);

Schema-on-Read Challenges

DynamoDB tables are schema-less. Two items in the same table can have completely different attributes. This creates three problems when syncing to Snowflake:

Inconsistent attributes. Item A might have a shippingAddress attribute while Item B does not. Your pipeline needs to handle missing fields gracefully, either by inserting NULL or by using VARIANT columns for optional attributes.

Type polymorphism. The same attribute name might be a string in one item and a number in another. DynamoDB allows this; Snowflake does not (at least not within a single typed column). Your transformation layer must detect and handle these conflicts, typically by coercing to the broader type or falling back to VARIANT.

Attribute name evolution. Over time, application code changes may rename attributes (e.g., addr becomes address). Without a schema registry or mapping layer, your Snowflake table accumulates both old and new column names.

Practical Pipeline Example

Consider an e-commerce orders table with approximately 50,000 writes per hour. Here is what the end-to-end flow looks like with a managed CDC platform:

  1. Source: DynamoDB table prod-orders with Streams enabled (NEW_AND_OLD_IMAGES).
  2. CDC connector reads from DynamoDB Streams, manages shard iteration, and checkpoints progress.
  3. Type unwrapping strips {"S": ...}, {"N": ...} wrappers and casts values to native types.
  4. Flattening extracts order_id, customer_id, total_amount, status, and created_at into columns. The items list stays as a JSON array.
  5. Snowflake sink writes micro-batches using Snowpipe Streaming for low latency, targeting an orders table in the analytics schema.
  6. Post-load transforms (dbt or Snowflake tasks) build a dim_orders model that joins with dim_customers.

With Streamkap, steps 2 through 5 are handled by configuring a DynamoDB source connector and a Snowflake destination connector. The platform manages shard discovery, type conversion, flattening, and Snowflake ingestion without requiring custom code.

Self-Managed vs Managed Pipelines

Building a DynamoDB-to-Snowflake pipeline with Lambda and S3 is straightforward in concept but demanding in practice. Here is what you take on:

ConcernSelf-Managed (Lambda + S3)Managed CDC Platform
Shard managementYou handle splits, merges, iterator expiryAutomated
CheckpointingDynamoDB or custom checkpoint tableBuilt-in
Type unwrappingCustom code in LambdaAutomatic
Schema evolutionManual ALTER TABLE statementsAutomatic column addition
Error handlingDead letter queues, retry logicBuilt-in with alerting
Small file problemCompaction jobs or careful bufferingOptimized batching
MonitoringCloudWatch dashboards you buildPre-built metrics
Snowflake authManage key pairs, rotate credentialsConfigured once

For teams running one or two tables with stable schemas, the Lambda approach is manageable. For larger deployments with dozens of tables and evolving schemas, the operational overhead compounds quickly.

Monitoring Your Pipeline

Regardless of architecture, there are key metrics to watch:

DynamoDB Streams Side

  • IteratorAge (CloudWatch): The age of the last record read from a shard. If this number climbs toward the 24-hour retention limit, your consumer is falling behind and you risk data loss. Set an alarm at 4 hours.
  • ReadThrottleEvents: Indicates your consumer is reading too aggressively. Back off or request a limit increase.
  • Number of active shards: A sudden spike means your table scaled up, and your consumer must handle more parallel reads.

Snowflake Side

  • Data freshness: Query the CDC timestamp column to measure how far behind Snowflake is from the source. A simple check:
SELECT
  DATEDIFF('second', MAX(_cdc_timestamp), CURRENT_TIMESTAMP()) AS lag_seconds
FROM analytics.orders;
  • Row counts: Compare DynamoDB item count (from table description) with Snowflake row count. For append-only tables, Snowflake will have more rows (one per event). For upsert tables, counts should approximately match.
  • Failed loads: Monitor Snowpipe’s COPY_HISTORY table function or Snowpipe Streaming’s offset lag to catch ingestion failures early.

Alerting Baseline

At minimum, set alerts for: iterator age exceeding 2 hours, data freshness exceeding your SLA (commonly 5 minutes for real-time use cases), and any ingestion errors in Snowflake. Most managed platforms, including Streamkap, provide these alerts out of the box along with pipeline health dashboards.

What This Means for Your Pipeline

Streaming DynamoDB to Snowflake bridges the gap between operational NoSQL workloads and analytical SQL queries. The core technical challenges are specific to DynamoDB: its typed attribute format requires unwrapping, its schema-less nature demands a flattening strategy, and its stream retention window leaves no room for downtime in your consumer. Whether you build on Lambda and S3 or use a managed CDC platform, understanding these fundamentals ensures your pipeline delivers accurate, timely data to Snowflake without surprises.