<--- Back to all resources

Database Technology

May 22, 2025

13 min read

MySQL Binary Log CDC Deep Dive: Formats, GTIDs, and Replication

Technical deep dive into MySQL binary log internals for CDC — covering row-based logging, GTID replication, event structure, and purge policies.

The MySQL binary log is the engine behind every CDC pipeline that reads from MySQL. It records every data change as a structured event, and CDC tools act as replication clients that read these events and forward them downstream. If you operate CDC pipelines against MySQL, understanding how the binlog works — formats, GTIDs, event structure, and purge behavior — directly affects your reliability and debugging speed.

Binary Log Basics

The binary log (binlog) is a sequence of files that records all changes to data and schema. MySQL appends events to the current binlog file and rotates to a new file when the current one reaches max_binlog_size (default 1 GB) or when you run FLUSH BINARY LOGS.

-- List all binlog files and their sizes
SHOW BINARY LOGS;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000042 | 1073741824|
| mysql-bin.000043 | 1073741824|
| mysql-bin.000044 |  536870912|
+------------------+-----------+

Each file has a numeric suffix that increments monotonically. The combination of filename + byte position within that file gives you a binlog coordinate — the traditional way to track replication position.

Binlog Formats: Statement vs Row vs Mixed

MySQL supports three binlog formats, and the choice matters enormously for CDC.

Statement-Based Replication (SBR)

SET GLOBAL binlog_format = 'STATEMENT';

Records the SQL statement text. A single UPDATE users SET last_login = NOW() WHERE active = 1 produces one binlog event regardless of how many rows it affects.

Why it breaks CDC: The statement NOW() returns a different value on the replica than it did on the source. UUID(), RAND(), user-defined variables, and session-specific settings all create the same problem. A CDC tool reading statement-based events cannot reliably reconstruct the actual row changes.

Row-Based Replication (RBR)

SET GLOBAL binlog_format = 'ROW';

Records the actual row data that changed — the before image and after image for each affected row. That same UPDATE might produce 10,000 individual row-change events if 10,000 rows matched.

Why CDC requires it: Row events are deterministic. The CDC consumer sees exactly which columns changed and what the new values are. No ambiguity, no non-deterministic functions, no session state dependency.

Mixed

SET GLOBAL binlog_format = 'MIXED';

Uses statement-based by default, switching to row-based for statements that MySQL detects as unsafe for statement replication. The problem is that “unsafe” detection is not perfect — some edge cases slip through. Mixed mode is not reliable for CDC.

The rule: always use binlog_format = ROW for CDC. Every major CDC tool requires it.

Row Image Settings

With row-based format, binlog_row_image controls which columns appear in the event:

SHOW VARIABLES LIKE 'binlog_row_image';
SettingBefore ImageAfter ImageCDC Impact
FULLAll columnsAll columnsBest for CDC — full context
MINIMALPK onlyChanged columns onlySmaller events, but downstream may miss unchanged columns
NOBLOBAll except unchanged BLOBsAll except unchanged BLOBsCompromise for tables with large BLOB columns

For CDC pipelines, FULL is the safest default. It means your downstream systems always get the complete row, making it easy to build materialized views, update search indexes, or sync to a warehouse. If binlog volume is a concern on tables with wide rows or large BLOB columns, NOBLOB is a reasonable middle ground.

Binlog Event Structure

Each binlog event has a fixed header and a type-specific body. The events that matter for CDC:

Event TypeDescription
QUERY_EVENTDDL statements (CREATE TABLE, ALTER TABLE) and transaction BEGIN
TABLE_MAP_EVENTMaps a table ID to database.table name + column types
WRITE_ROWS_EVENTINSERT — contains the new row data
UPDATE_ROWS_EVENTUPDATE — contains before and after row data
DELETE_ROWS_EVENTDELETE — contains the deleted row data
XID_EVENTTransaction COMMIT
GTID_EVENTGTID for the following transaction

A typical transaction in the binlog looks like this:

GTID_EVENT         → server_uuid:42
QUERY_EVENT        → BEGIN
TABLE_MAP_EVENT    → table_id=85, db=myapp, table=orders
WRITE_ROWS_EVENT   → {id: 1001, customer_id: 55, total: 299.99, ...}
XID_EVENT          → COMMIT

You can inspect binlog events directly with mysqlbinlog:

# Decode a binlog file with row event details
mysqlbinlog --verbose --base64-output=DECODE-ROWS mysql-bin.000044

# Decode only events for a specific database
mysqlbinlog --verbose --base64-output=DECODE-ROWS \
  --database=myapp mysql-bin.000044

# Decode events from a specific position
mysqlbinlog --verbose --base64-output=DECODE-ROWS \
  --start-position=4567890 mysql-bin.000044

The --verbose flag translates row events into pseudo-SQL so you can see the actual column values:

### UPDATE `myapp`.`orders`
### WHERE
###   @1=1001
###   @2=55
###   @3=299.99
###   @4='pending'
### SET
###   @1=1001
###   @2=55
###   @3=299.99
###   @4='shipped'

Column names are not in the binlog event itself — they are referenced by ordinal position (@1, @2, etc.). CDC tools resolve these positions to column names using the table schema, which is why schema changes require careful handling. See schema change management for strategies.

GTID-Based Replication

Global Transaction Identifiers solve a fundamental problem with traditional binlog-position tracking: what happens during a failover.

The Failover Problem

With file+position tracking, a CDC tool records something like “I’ve read up to mysql-bin.000044 position 892341.” If the primary fails and you promote a replica, the new primary has different binlog files with different positions. The CDC tool’s bookmark is meaningless.

How GTIDs Work

Each transaction gets a globally unique identifier:

source_uuid:transaction_id

For example: 3E11FA47-71CA-11E1-9E33-C80AA9429562:42

The source_uuid is the server’s server_uuid (generated on first startup), and the transaction_id is a monotonically increasing sequence number per server.

A CDC tool tracking GTIDs says: “I’ve processed all transactions in this GTID set: 3E11FA47-...:1-4582.” After a failover to a replica, the new primary has the same GTID history (because it replicated those transactions), so the CDC tool can resume from exactly the right point.

Enabling GTIDs

-- Check current GTID status
SHOW VARIABLES LIKE '%gtid%';

-- Required settings in my.cnf
-- gtid_mode = ON
-- enforce_gtid_consistency = ON
# my.cnf / my.ini
[mysqld]
gtid_mode = ON
enforce_gtid_consistency = ON
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL

In MySQL 8.0+, you can enable GTIDs online without restarting, though it requires stepping through intermediate states:

SET GLOBAL gtid_mode = OFF_PERMISSIVE;
SET GLOBAL gtid_mode = ON_PERMISSIVE;
-- Wait for all anonymous transactions to complete
SET GLOBAL enforce_gtid_consistency = ON;
SET GLOBAL gtid_mode = ON;

Checking GTID Position

-- Executed GTID set on this server
SELECT @@global.gtid_executed;

-- GTID set that has been purged (no longer in binlogs)
SELECT @@global.gtid_purged;

If a CDC consumer needs a GTID that falls within gtid_purged, it means those binlog files have been deleted. The consumer must re-snapshot. This is the GTID equivalent of the binlog-position-purge problem.

server_id Conflicts

Every MySQL replication client (replicas and CDC tools) must have a unique server_id. When two clients connect with the same server_id, MySQL disconnects one of them — usually the one that connected first.

This manifests as intermittent disconnections in your CDC pipeline. The logs show the consumer connecting, reading for a while, then getting killed when something else connects with the same ID.

-- Check current server_id
SELECT @@server_id;

-- See all connected replication threads
SHOW SLAVE HOSTS;  -- MySQL 5.7
SHOW REPLICAS;     -- MySQL 8.0.22+

Best practice: Assign server_id values from different ranges for different purposes. For example: 1-100 for read replicas, 1001-2000 for CDC consumers. Document the assignments.

Binlog Purge Policies

MySQL does not keep binlog files forever. The retention policy determines how long you have to recover if a CDC consumer goes down.

MySQL 8.0+

-- Check current retention (seconds, default = 2592000 = 30 days)
SHOW VARIABLES LIKE 'binlog_expire_logs_seconds';

-- Set to 7 days
SET GLOBAL binlog_expire_logs_seconds = 604800;

MySQL 5.7

-- Check current retention (days, default = 0 = never expire)
SHOW VARIABLES LIKE 'expire_logs_days';

-- Set to 7 days
SET GLOBAL expire_logs_days = 7;

Manual Purge

-- Purge binlogs older than a specific file
PURGE BINARY LOGS TO 'mysql-bin.000040';

-- Purge binlogs older than a date
PURGE BINARY LOGS BEFORE '2025-05-01 00:00:00';

Sizing retention for CDC: Set retention to at least 2x your longest expected CDC downtime. If your CDC pipeline might be down for maintenance for 24 hours, keep at least 48 hours of binlogs. Monitor binlog disk usage with:

-- Total binlog disk usage
SHOW BINARY LOGS;

-- Current binlog file
SHOW MASTER STATUS;

If you are running streaming data pipelines with strict freshness requirements, you want retention long enough to survive incidents but not so long that you waste disk on data nobody will read.

Monitoring CDC Health from MySQL’s Side

Active Replication Connections

-- See all replication consumers
SHOW PROCESSLIST;
-- Look for Command = 'Binlog Dump' or 'Binlog Dump GTID'

Binlog Throughput

-- In MySQL 8.0 with performance_schema
SELECT * FROM performance_schema.global_status
WHERE VARIABLE_NAME IN (
    'Binlog_cache_disk_use',
    'Binlog_cache_use',
    'Binlog_stmt_cache_disk_use',
    'Com_binlog'
);

Consumer Position vs Current Position

-- Current binlog position
SHOW MASTER STATUS;

-- Compare against your CDC tool's reported position
-- If the gap is growing, the consumer is falling behind

If you are seeing growing lag between where MySQL is writing and where your CDC consumer is reading, the troubleshooting is similar to Kafka consumer lag debugging — identify whether the bottleneck is the source (too many events), the network, or the sink (slow writes to the destination).

Configuration Checklist for CDC

Here is the minimum my.cnf configuration for a MySQL instance serving CDC consumers:

[mysqld]
# Binary logging
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
binlog_expire_logs_seconds = 604800  # 7 days

# GTIDs (strongly recommended)
gtid_mode = ON
enforce_gtid_consistency = ON

# Replication slots/consumers
server_id = 1

# Allow CDC tool connections
# Ensure the CDC user has REPLICATION SLAVE and REPLICATION CLIENT privileges

Required User Privileges

CREATE USER 'cdc_user'@'%' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'cdc_user'@'%';
GRANT SELECT ON myapp.* TO 'cdc_user'@'%';  -- For initial snapshot
FLUSH PRIVILEGES;

Common Pitfalls

1. Using statement-based or mixed binlog format CDC tools will either refuse to connect or produce incorrect data. Always verify binlog_format = ROW before setting up a pipeline.

2. Forgetting binlog_row_image = FULL With MINIMAL, your downstream system only sees changed columns. If it needs full rows (most warehouses and search indexes do), you get incomplete records with no clear error.

3. Aggressive binlog purging If you purge binlogs before your CDC consumer has read them, the consumer cannot resume. It must re-snapshot the entire database, which is slow and puts load on the source. For handling schema changes gracefully during this process, see CDC schema evolution at zero downtime.

4. server_id collisions Two CDC tools (or a CDC tool and a replica) with the same server_id will fight for the connection. One of them will keep disconnecting.

5. Not monitoring binlog disk usage Binlog files consume real disk. If you set retention to 30 days on a write-heavy database, you might need hundreds of GB just for binlogs. Monitor and alert on binlog directory size.

Managed CDC and the Binlog

When you run a managed CDC platform like Streamkap against MySQL, the platform handles binlog-position tracking (or GTID tracking), server_id assignment, automatic re-snapshotting when binlogs are purged, and consumer lifecycle management. You still need to configure the MySQL side correctly — binlog_format = ROW, appropriate retention, and user privileges — but the operational burden of managing the consumer shifts to the platform.

Understanding the internals in this article helps you configure MySQL correctly, debug issues when they arise, and have informed conversations about replication topology and retention policies with your DBA.


Ready to stream MySQL changes without managing binlog consumers? Streamkap connects to MySQL’s binary log with automatic GTID tracking, handles snapshots and recovery, and delivers row-level changes to your destination in real time. Start a free trial or explore available connectors.