<--- Back to all resources
Schema Registry in Stream Processing: Why Your Streams Need a Contract
Learn why schema registry is essential for production streaming pipelines. Understand schema evolution, compatibility modes, and how to prevent breaking changes in real-time data.
Every streaming pipeline makes an implicit promise: the data a producer sends is the data a consumer expects to receive. When that promise holds, everything works. When it breaks - when a producer starts sending a field that used to be an integer as a string, or drops a column a downstream model depends on - things fail. Sometimes loudly. More often, silently.
Schema drift is the single most common cause of production streaming pipeline failures. A recent survey of data engineering teams found that over 60% of pipeline incidents trace back to unexpected schema changes. The fix is not more vigilant engineers or better runbooks. The fix is a contract - an enforceable, versioned agreement about what data looks like. That contract is called a schema, and the service that manages it is a schema registry.
This guide walks through what a schema registry does, why it matters, how it integrates with Kafka and CDC pipelines, and how to use it effectively in production.
What a Schema Registry Does
A schema registry is a centralized service that stores, versions, and enforces schemas for streaming data. It sits alongside your message broker (typically Kafka) and acts as the single source of truth for the structure of every topic.
The core responsibilities are straightforward:
- Store schemas. Every schema is registered under a subject (usually tied to a Kafka topic) and assigned a unique numeric ID.
- Version schemas. When a schema changes, the registry stores the new version alongside all previous versions, creating a complete history.
- Enforce compatibility. Before a new schema version is accepted, the registry checks it against the existing versions using configurable compatibility rules. If the change would break consumers, the registration is rejected.
This means that a producer cannot push data into a topic unless the data’s schema passes the registry’s compatibility check. Consumers retrieve the schema by ID when deserializing, so they always know exactly what structure to expect.
Without a registry, the only way to discover a breaking schema change is when something downstream fails. With a registry, the breaking change is caught at the moment of registration - before a single bad record enters the pipeline.
Schema Formats
Three serialization formats dominate the streaming ecosystem. Each handles schema differently.
| Feature | Avro | Protobuf | JSON Schema |
|---|---|---|---|
| Schema embedded in data | No (schema stored separately) | No (schema stored separately) | No |
| Schema language | Avro IDL / JSON | .proto files | JSON Schema spec |
| Binary format | Yes, compact | Yes, very compact | No, text-based |
| Default values | Supported | Supported | Supported |
| Schema evolution | Excellent | Excellent | Limited |
| Human readability | Low (binary) | Low (binary) | High (JSON) |
| Kafka ecosystem support | Native, first-class | Strong (Confluent 5.5+) | Supported |
| Typical use case | Kafka-native pipelines | gRPC + Kafka hybrid | REST APIs feeding into Kafka |
Avro remains the most common choice in Kafka-centric architectures because it was designed alongside Kafka and has the deepest integration with the Confluent ecosystem. Protobuf is increasingly popular in organizations that already use gRPC for service-to-service communication. JSON Schema appeals to teams that prioritize human readability or have existing REST-based producers.
Regardless of the format, the schema registry workflow is the same: register the schema, get an ID, embed the ID in messages, retrieve the schema on the consumer side.
Compatibility Modes
Compatibility modes are the rules that govern what schema changes are allowed. They are the teeth of the contract. Choosing the right mode depends on how you deploy producer and consumer updates.
Backward Compatibility
New schemas can read data written with old schemas. This is the default mode in most registries and the safest choice for most teams.
Allowed changes:
- Add a new field with a default value
- Remove a field
Blocked changes:
- Add a required field without a default
- Change a field’s type
Deployment order: Upgrade consumers first, then producers.
Example: adding an optional discount_code field with a default of null to an orders schema is backward compatible. Consumers running the new schema can still read old messages that lack the field - they simply use the default.
Forward Compatibility
Old schemas can read data written with new schemas. Useful when you need to upgrade producers before consumers.
Allowed changes:
- Add a field (old consumers ignore it)
- Remove a field that has a default
Blocked changes:
- Remove a required field without a default
- Change a field’s type
Deployment order: Upgrade producers first, then consumers.
Full Compatibility
Both backward and forward compatible. Changes must be readable by both old and new schemas simultaneously.
Allowed changes:
- Add a field with a default value
- Remove a field that has a default value
This is the most restrictive but safest mode. It guarantees that producers and consumers can be upgraded in any order.
Transitive Variants
Each mode has a transitive variant (e.g., BACKWARD_TRANSITIVE, FULL_TRANSITIVE) that checks compatibility not just against the latest schema version, but against all previous versions. Use transitive modes when consumers may be reading historical data or when you cannot guarantee all consumers are on a recent schema version.
Schema Evolution in Practice
Not all schema changes carry the same risk. Here is a practical breakdown.
Adding a Column (Generally Safe)
Adding a new field with a default value is the most common and safest schema change. Under backward compatibility, old consumers ignore the new field. Under forward compatibility, old consumers use the default value.
// Version 1
{
"type": "record",
"name": "Order",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "currency", "type": "string"}
]
}
// Version 2 - adding discount_code with default
{
"type": "record",
"name": "Order",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "currency", "type": "string"},
{"name": "discount_code", "type": ["null", "string"], "default": null}
]
}
Removing a Column (Depends on Mode)
Removing a field is backward compatible (new consumers do not need it) but not forward compatible unless the removed field had a default value. In full compatibility mode, you can only remove fields that have defaults.
Renaming a Column (Unsafe)
Renaming is effectively a remove-and-add operation. Most registries treat it as a breaking change under any compatibility mode. The safe approach is to add the new field, run both in parallel until all consumers have migrated, then remove the old field.
Changing a Field’s Type (Usually Unsafe)
Widening a type (e.g., int to long) is allowed in Avro because it is a promoted type. Narrowing (e.g., long to int) or changing categories (e.g., string to int) is always a breaking change.
Avro type promotions (safe):
int → long → float → double
bytes → string
Any other type change requires a new field or a new topic.
How Schema Registry Works with Kafka
The integration between a schema registry and Kafka is surprisingly simple. It relies on a convention: the first few bytes of every Kafka message value contain a schema ID rather than embedding the full schema.
Producer Flow
- The producer serializes a record using Avro (or Protobuf/JSON Schema).
- The serializer checks the registry: “Does this schema already exist for subject
orders-value?” - If yes, it gets back the schema ID. If no, it registers the schema (compatibility check happens here) and gets a new ID.
- The serializer prepends a magic byte (
0x00) and the 4-byte schema ID to the serialized payload. - The message is sent to Kafka.
Consumer Flow
- The consumer reads a message from Kafka.
- The deserializer reads the magic byte and extracts the 4-byte schema ID.
- It fetches the corresponding schema from the registry (cached locally after the first fetch).
- It deserializes the payload using that schema and the consumer’s reader schema.
- Schema resolution handles any differences (missing fields get defaults, extra fields are ignored).
Message wire format:
┌──────────┬───────────┬──────────────────────┐
│ 0x00 │ Schema ID │ Serialized payload │
│ (1 byte) │ (4 bytes) │ (variable length) │
└──────────┴───────────┴──────────────────────┘
This design means the schema is stored once in the registry rather than in every message, dramatically reducing message size compared to embedding the schema inline.
CDC and Schema Evolution
Change data capture introduces a unique schema evolution challenge. In a typical CDC pipeline, the schema is not defined by application developers writing Avro files - it is derived from the source database. When someone runs an ALTER TABLE statement, the pipeline schema must change to match.
Consider this sequence:
-- Original table
CREATE TABLE orders (
order_id VARCHAR(36) PRIMARY KEY,
amount DECIMAL(10,2),
currency CHAR(3),
created_at TIMESTAMP
);
-- Six months later, a developer adds a column
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50) DEFAULT NULL;
-- Three months after that, a type gets widened
ALTER TABLE orders MODIFY COLUMN amount DECIMAL(12,4);
Each of these DDL statements changes the schema of the CDC events flowing into your Kafka topics. A well-designed CDC pipeline detects these changes automatically by monitoring the database’s transaction log or metadata, registers the new schema version with the registry, and continues streaming without interruption.
Streamkap handles this entire flow automatically. When a source database schema changes - a new column appears, a type widens - Streamkap detects the change, updates the pipeline schema, and propagates the modification to destination tables. New columns are added, types are adjusted, and the pipeline continues without manual intervention or restarts. This eliminates the most common operational burden in CDC: the on-call engineer who gets paged at 2 a.m. because someone added a column and the pipeline does not know about it.
Practical Example: E-Commerce Orders Schema Over Time
Let us trace a realistic example of an orders schema evolving over the life of a product.
Version 1 (Launch day):
{
"type": "record",
"name": "Order",
"namespace": "com.example.ecommerce",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "amount", "type": "int", "doc": "Amount in cents"},
{"name": "currency", "type": "string"},
{"name": "created_at", "type": "long", "doc": "Unix timestamp millis"}
]
}
Version 2 (Month 3 --- adding discount support):
The team adds a nullable discount_code field. This is backward compatible because it has a default value. Consumers on Version 1 simply see null for the new field.
{"name": "discount_code", "type": ["null", "string"], "default": null}
Version 3 (Month 8 --- widening amount type):
Order values start exceeding the int range. The team widens amount from int to long. In Avro, int to long is a safe promotion. The schema registry accepts this change under backward compatibility because Avro readers can automatically promote int values to long.
{"name": "amount", "type": "long", "doc": "Amount in cents"}
Version 4 (Month 14 --- adding shipping address):
The team adds a nested shipping_address record with a default of null. All existing consumers continue to work. New consumers that care about shipping addresses can start reading the field immediately.
{
"name": "shipping_address",
"type": ["null", {
"type": "record",
"name": "Address",
"fields": [
{"name": "street", "type": "string"},
{"name": "city", "type": "string"},
{"name": "country", "type": "string"},
{"name": "postal_code", "type": "string"}
]
}],
"default": null
}
Each of these changes was registered with the schema registry, checked for compatibility, and deployed without any consumer downtime. That is the value of the contract.
Schema Registry Options
Several schema registry implementations are available, each with different trade-offs.
Confluent Schema Registry is the original and most widely deployed. It supports Avro, Protobuf, and JSON Schema, integrates natively with Confluent Platform and Confluent Cloud, and provides a REST API for schema management. It uses Kafka itself as its storage backend, which simplifies operations.
AWS Glue Schema Registry is a managed option for teams running on AWS. It integrates with MSK (Managed Streaming for Kafka), Kinesis, and the broader AWS ecosystem. It supports Avro and JSON Schema. The managed nature eliminates operational overhead, but it locks you into the AWS ecosystem.
Apicurio Registry is an open-source alternative from Red Hat. It supports Avro, Protobuf, JSON Schema, GraphQL, and more. It can use Kafka, SQL databases, or in-memory storage as its backend. It is a strong choice for teams that want registry functionality without the Confluent licensing.
Karapace is another open-source option, originally developed by Aiven. It provides API compatibility with the Confluent Schema Registry, meaning you can swap it in as a drop-in replacement. It stores schemas in Kafka topics, just like the Confluent registry.
For most teams starting out, the Confluent Schema Registry - either self-hosted or via Confluent Cloud - is the safest default. It has the largest community, the most documentation, and the deepest integration with Kafka tooling.
Best Practices
Default to backward compatibility. It is the right choice for the vast majority of use cases. It lets you upgrade consumers first, which is the natural deployment order for most teams. Only switch to forward or full compatibility if you have a specific operational reason.
Test schema changes in staging. Before registering a new schema version in production, test it against your staging registry. Verify that consumers on the previous version can still deserialize data written with the new version. Automate this in CI.
Use a subject naming strategy. The default TopicNameStrategy (subject = topic name + -value or -key) works well for most cases. Use RecordNameStrategy if multiple record types share a topic, or TopicRecordNameStrategy for multi-type topics where you still want per-topic governance.
Never remove the compatibility check. It is tempting to set compatibility to NONE when you want to make a breaking change quickly. Resist this. Instead, create a new topic with the new schema and migrate consumers. The short-term convenience of disabling compatibility creates long-term operational debt.
Document your schemas. Avro supports a doc field on both records and individual fields. Use it. Six months from now, when someone is debugging a pipeline issue at midnight, that documentation is the difference between a 10-minute fix and a 2-hour investigation.
Version your schema files in source control. Treat .avsc, .proto, and .json schema files the same way you treat database migration files - checked into version control, reviewed in pull requests, and deployed through a pipeline. The schema registry is the runtime enforcement; source control is the audit trail.
Set up schema change alerts. Configure your registry to notify your team (via Slack, PagerDuty, or your alerting tool of choice) whenever a new schema version is registered. This creates visibility into schema evolution across the organization and prevents surprises.
These practices, combined with a schema registry, transform schema management from a reactive firefighting exercise into a proactive, governed process. The result is streaming pipelines that handle change gracefully instead of breaking under it.