<--- Back to all resources

Engineering

February 25, 2026

10 min read

MySQL to Databricks: Streaming CDC to Your Lakehouse

Stream real-time MySQL changes to Databricks using CDC. Learn how to build a lakehouse pipeline with Delta Lake, handle schema evolution, and enable real-time analytics.

TL;DR: • MySQL CDC reads the binlog to capture every INSERT, UPDATE, and DELETE, streaming changes to Databricks Delta Lake tables with ACID guarantees. • Delta Lake's MERGE INTO operation is the key to applying CDC changes. It handles upserts and deletes in a single atomic transaction. • Managed CDC platforms eliminate the complexity of running MySQL binlog readers, Kafka clusters, and Spark jobs yourself.

Streaming MySQL data into Databricks with Change Data Capture gives your lakehouse a live connection to your transactional systems. Instead of waiting hours for batch exports to land, every INSERT, UPDATE, and DELETE in MySQL flows into Delta Lake tables within seconds. This enables real-time analytics, fresh dashboards, and ML feature stores that actually reflect what is happening in your business right now.

MySQL powers a massive share of the world’s OLTP workloads. Databricks, with its lakehouse architecture built on Delta Lake, has become the platform of choice for unifying analytics and AI. Bridging the two with CDC means you can retire fragile nightly ETL scripts and replace them with a continuous, low-latency pipeline that keeps your lakehouse in lockstep with production.

Architecture Options for MySQL CDC to Databricks

There is no single way to move CDC data from MySQL to Databricks. The right architecture depends on your latency requirements, existing infrastructure, and operational appetite.

CDC to Kafka to Databricks

This is the most common pattern in production environments. A CDC connector such as Debezium reads the MySQL binlog and publishes change events to Apache Kafka topics. On the Databricks side, Spark Structured Streaming reads from those Kafka topics and writes to Delta Lake tables.

The advantage is decoupling. Kafka acts as a durable buffer between the source and the destination. If Databricks is temporarily unavailable for maintenance, Kafka retains the events until consumption resumes. The tradeoff is operational complexity: you now have a Kafka cluster to provision, monitor, and scale.

CDC to Cloud Storage to Auto Loader

In this approach, CDC events are written as files (JSON, Avro, or Parquet) to Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage. Databricks Auto Loader then incrementally ingests those files into Delta Lake tables.

This pattern avoids the need for a persistent streaming cluster on the Databricks side. Auto Loader uses file notification or directory listing to discover new files efficiently, making it cost-effective for workloads where sub-second latency is not required. Typical end-to-end latency ranges from one to five minutes depending on file flush intervals.

Direct Managed Streaming

Managed CDC platforms like Streamkap abstract the entire pipeline. You configure a MySQL source and a Databricks Delta Lake destination, and the platform handles binlog reading, event serialization, delivery, and exactly-once semantics. There is no Kafka cluster or S3 staging bucket for you to manage.

This option delivers the lowest operational burden while maintaining low latency, making it a strong fit for teams that want to focus on analytics rather than pipeline plumbing.

MySQL CDC Setup

Before any CDC tool can read change events, MySQL must be configured to produce them. The binary log is the foundation of everything that follows.

Binary Log Configuration

Set the following in your MySQL configuration file (my.cnf or my.ini):

[mysqld]
server-id         = 1
log_bin           = mysql-bin
binlog_format     = ROW
binlog_row_image  = FULL
expire_logs_days  = 7
  • binlog_format=ROW ensures that actual row-level data changes are logged, not the SQL statements that caused them. Statement-based logging is ambiguous for CDC because non-deterministic functions produce different results on replay.
  • binlog_row_image=FULL writes both the before and after images of every modified row. This is essential for producing accurate UPDATE and DELETE events downstream.
  • expire_logs_days (or binlog_expire_logs_seconds on MySQL 8.0+) controls how long binlog files are retained. Set this high enough that your CDC tool can complete an initial snapshot before old logs are purged.

GTID Mode

Global Transaction Identifiers make binlog position tracking deterministic across failovers. If your MySQL deployment uses replication, enabling GTIDs is strongly recommended:

gtid_mode                = ON
enforce_gtid_consistency = ON

With GTIDs enabled, the CDC connector can resume from the exact transaction after a restart or failover without ambiguity.

User Permissions

Create a dedicated MySQL user for the CDC connector with the minimum required privileges:

CREATE USER 'cdc_user'@'%' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'cdc_user'@'%';
GRANT SELECT ON your_database.* TO 'cdc_user'@'%';
FLUSH PRIVILEGES;

The SELECT grant is needed for the initial consistent snapshot. REPLICATION SLAVE and REPLICATION CLIENT are required to read the binlog stream.

Databricks Ingestion Methods

Once CDC events leave MySQL, they need to land in Delta Lake. Databricks offers several ingestion mechanisms, each suited to different pipeline architectures.

Auto Loader

Auto Loader incrementally processes new files as they arrive in cloud storage. It is ideal for the CDC-to-storage pattern:

df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.schemaLocation", "/mnt/checkpoints/mysql_cdc_schema")
      .load("/mnt/data/mysql-cdc-events/"))

df.writeStream \
  .format("delta") \
  .option("checkpointLocation", "/mnt/checkpoints/mysql_cdc") \
  .option("mergeSchema", "true") \
  .toTable("bronze.mysql_cdc_raw")

Auto Loader handles schema inference and evolution automatically. When new columns appear in the CDC events, it merges them into the target schema without manual intervention.

Structured Streaming from Kafka

When CDC events flow through Kafka, use Spark Structured Streaming to consume them directly:

df = (spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "broker:9092")
      .option("subscribe", "mysql.inventory.orders")
      .option("startingOffsets", "earliest")
      .load())

This approach gives you the lowest latency path into Databricks, with events typically arriving within seconds of the MySQL commit.

Delta Live Tables

Delta Live Tables (DLT) provide a declarative framework for building reliable pipelines. You define expectations and transformations, and DLT handles orchestration, monitoring, and error recovery:

import dlt

@dlt.table(comment="Raw CDC events from MySQL orders table")
def orders_raw():
    return (spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "json")
            .load("/mnt/data/mysql-cdc/orders/"))

@dlt.table(comment="Materialized orders with CDC applied")
def orders():
    return dlt.read_stream("orders_raw").apply_changes(
        target="orders",
        keys=["id"],
        sequence_by="ts_ms",
        apply_as_deletes=expr("op = 'd'"),
        except_column_list=["_rescued_data"]
    )

DLT’s apply_changes API is purpose-built for CDC workloads and handles the MERGE logic internally.

Applying CDC Changes with MERGE INTO

The core of any CDC-to-lakehouse pipeline is the MERGE INTO statement. It takes a batch of change events and atomically applies inserts, updates, and deletes to the target Delta table.

CDC events from tools like Debezium carry an operation type field (op): c for create, u for update, d for delete, and r for read (snapshot). The MERGE maps these operations to SQL actions:

MERGE INTO lakehouse.orders AS target
USING staging.orders_cdc AS source
ON target.id = source.id

WHEN MATCHED AND source.op = 'd' THEN
  DELETE

WHEN MATCHED AND source.op IN ('u', 'c') THEN
  UPDATE SET
    target.customer_id  = source.customer_id,
    target.total_amount = source.total_amount,
    target.status       = source.status,
    target.updated_at   = source.updated_at

WHEN NOT MATCHED AND source.op != 'd' THEN
  INSERT (id, customer_id, total_amount, status, updated_at)
  VALUES (source.id, source.customer_id, source.total_amount, source.status, source.updated_at);

A few things to watch for:

  • Ordering matters. If multiple changes to the same row arrive in a single micro-batch, you need to deduplicate and keep only the latest event (by timestamp or binlog position) before running the MERGE.
  • Soft deletes vs. hard deletes. Some teams prefer to mark rows as deleted with a flag rather than physically removing them. Replace the DELETE clause with an UPDATE SET is_deleted = true if that matches your requirements.
  • Idempotency. A well-constructed MERGE is naturally idempotent. If the same batch is replayed after a failure, the result is the same.

Schema Evolution

Production MySQL databases change constantly. Columns get added, types get widened, and tables get renamed. Your CDC pipeline needs to handle this gracefully or it will break at the worst possible time.

Column Additions and Removals

When a new column is added to a MySQL table, the CDC connector includes it in subsequent events. On the Databricks side, Delta Lake’s schema evolution support handles this automatically if you enable the mergeSchema option:

df.writeStream \
  .option("mergeSchema", "true") \
  .format("delta") \
  .toTable("bronze.orders")

Column removals are trickier. The CDC events simply stop including the dropped column. Existing rows in the Delta table retain the old column with null values going forward. You should periodically clean up stale columns with ALTER TABLE DROP COLUMN in Databricks if they are no longer needed.

Type Changes

MySQL type changes such as widening a VARCHAR(50) to VARCHAR(255) or changing an INT to a BIGINT are generally safe because they map to compatible Spark types. However, narrowing conversions or incompatible changes (like INT to VARCHAR) require manual intervention. Monitor your CDC connector’s schema change logs and apply corresponding ALTER TABLE statements in Databricks when necessary.

ALTER TABLE Propagation

Some CDC connectors emit DDL events alongside DML events. If your connector supports it, you can build automation that intercepts ALTER TABLE events and applies the equivalent DDL to your Delta tables. This closes the loop on fully automated schema evolution.

Delta Lake Optimization

Delta tables accumulate small files over time, especially with streaming writes that produce frequent micro-batches. Regular maintenance keeps query performance high.

OPTIMIZE and Z-ORDER

Run OPTIMIZE to compact small files into larger ones, and Z-ORDER on frequently filtered columns to co-locate related data:

OPTIMIZE lakehouse.orders
ZORDER BY (customer_id, status);

Schedule this as a nightly or weekly job depending on write volume. For high-throughput tables, Databricks also supports auto-compaction and optimized writes that reduce the need for manual OPTIMIZE runs.

VACUUM

Delta Lake retains old file versions for time travel. The VACUUM command removes files older than the retention threshold:

VACUUM lakehouse.orders RETAIN 168 HOURS;

The default retention is seven days. Do not set it below this unless you have disabled the safety check, as active readers may still reference older file versions.

Table Maintenance Checklist

  • Enable auto-compaction (delta.autoOptimize.autoCompact = true) on streaming target tables.
  • Enable optimized writes (delta.autoOptimize.optimizeWrite = true) to reduce file fragmentation at write time.
  • Schedule OPTIMIZE with ZORDER for analytical query patterns.
  • Run VACUUM on a regular cadence to reclaim storage.
  • Monitor table size and file counts with DESCRIBE DETAIL lakehouse.orders.

Data Type Mapping

MySQL and Spark/Delta use different type systems. Getting the mapping right prevents silent data corruption.

MySQL TypeSpark / Delta TypeNotes
TINYINTByteType (or BooleanType for TINYINT(1))TINYINT(1) is often used as boolean
SMALLINTShortType
INTIntegerType
BIGINTLongType
FLOATFloatType
DOUBLEDoubleType
DECIMAL(p,s)DecimalType(p,s)Precision and scale preserved
VARCHAR / TEXTStringType
DATEDateType
DATETIME / TIMESTAMPTimestampTypeTimezone handling varies by connector
JSONStringTypeParse with from_json() in Spark
ENUMStringTypeSerialized as the string value
SETStringTypeComma-separated string of values
BLOB / BINARYBinaryType
GEOMETRY / POINTStringType (WKT or GeoJSON)Connector-dependent serialization

Watch out for DATETIME vs. TIMESTAMP semantics. MySQL DATETIME has no timezone, while TIMESTAMP is stored in UTC. Ensure your CDC connector normalizes these consistently, and document the convention for downstream consumers.

ENUM and SET types deserve attention. Most CDC connectors serialize ENUM as the string label rather than the internal integer index. Verify this behavior in your connector’s documentation to avoid subtle bugs where an enum value of 2 becomes the string "2" instead of its label.

Self-Managed vs. Managed Pipelines

Building a MySQL-to-Databricks CDC pipeline from open-source components is entirely possible. You would run Debezium on Kafka Connect, operate a Kafka cluster for event buffering, and write Spark Structured Streaming jobs to consume and merge events into Delta Lake. Each of these components needs its own monitoring, scaling, and failure recovery.

Here is what the self-managed stack looks like in practice:

  • Debezium + Kafka Connect: Deploy, configure connectors, manage offsets, handle rebalances.
  • Kafka cluster: Provision brokers, manage topics and retention, monitor consumer lag.
  • Spark Streaming jobs: Write MERGE logic, handle schema evolution, tune micro-batch intervals.
  • Monitoring: Stitch together metrics from Debezium, Kafka, and Spark into a single view.
  • Failure recovery: Build runbooks for connector failures, Kafka broker outages, and checkpoint corruption.

A managed platform like Streamkap collapses this entire stack into a configured source and destination pair. The platform handles binlog reading, event delivery, schema evolution, and exactly-once writes to Delta Lake. You get a single dashboard for monitoring pipeline health, latency, and throughput instead of juggling three separate systems.

The decision often comes down to team size and operational maturity. If you already run Kafka and have a dedicated platform team, self-managed gives you maximum control. If your priority is shipping analytics faster with a small data team, managed CDC removes the infrastructure overhead entirely.

Monitoring and Observability

A CDC pipeline is only as reliable as your ability to detect problems before they affect downstream consumers.

Binlog Position and Lag

Track the CDC connector’s current binlog position (or GTID) relative to the MySQL primary’s latest position. The gap between the two is your replication lag. A healthy pipeline keeps this under a few seconds. If the gap grows, investigate connector throughput, network latency, or resource constraints on the connector host.

-- On MySQL: check current binlog position
SHOW MASTER STATUS;

-- Check replication lag if reading from a replica
SHOW SLAVE STATUS\G

Delta Table Metrics

Databricks provides table-level metrics that reveal ingestion health:

DESCRIBE HISTORY lakehouse.orders LIMIT 10;

This shows recent operations, the number of rows affected, and timing. Look for anomalies: a MERGE that suddenly affects zero rows may indicate that upstream events have stopped flowing.

Data Freshness

Define a freshness SLA for each table and alert when it is breached. A simple approach is to compare the maximum updated_at timestamp in the Delta table against the current wall clock:

SELECT TIMESTAMPDIFF(
  SECOND,
  MAX(updated_at),
  CURRENT_TIMESTAMP()
) AS staleness_seconds
FROM lakehouse.orders;

If staleness_seconds exceeds your threshold, trigger an alert. Managed platforms like Streamkap provide built-in freshness monitoring and alerting, so you do not need to build this instrumentation yourself.

Key Metrics to Track

  • Connector lag: Difference between MySQL binlog head and connector read position.
  • Kafka consumer lag: Number of unconsumed messages per partition (if using Kafka).
  • Micro-batch duration: Time each Spark Structured Streaming batch takes to process.
  • MERGE throughput: Rows merged per second into the Delta table.
  • File count: Number of files in the Delta table directory; rising counts signal a need for OPTIMIZE.
  • Schema change events: Log and alert on any DDL changes detected by the connector.

By combining source-side, transport-layer, and destination-side metrics, you get end-to-end visibility into your MySQL-to-Databricks pipeline and can diagnose issues at any point in the chain.