What is a schema registry?

A schema registry is a centralized service that stores, versions, and enforces schemas (Avro, Protobuf, or JSON Schema) for streaming data. Producers register their schema before sending data, and consumers retrieve the schema to deserialize. The registry enforces compatibility rules that prevent producers from making breaking changes that would crash consumers.

Do I need a schema registry if I use JSON?

Yes, arguably even more so. JSON has no built-in schema enforcement, so without a registry, any producer can send any JSON structure. A JSON Schema registry validates structure, types, and required fields before messages enter the pipeline. Even with Avro or Protobuf (which embed schema), a registry provides compatibility checking and schema evolution governance.

<--- Back to all resources

Engineering

February 25, 2026

10 min read

Schema Registry in Stream Processing: Why Your Streams Need a Contract

Q: What are schema compatibility modes?

Backward compatibility means new schemas can read data written with old schemas (safe to upgrade consumers first). Forward compatibility means old schemas can read data written with new schemas (safe to upgrade producers first). Full compatibility requires both. Transitive variants check against all historical versions, not just the latest.

Q: How does Streamkap handle schemas in streaming?

Streamkap manages schema evolution automatically. When source database schemas change (new columns, type changes), Streamkap detects the change, updates the pipeline schema, and propagates the change to destination tables - adding new columns or adjusting types without manual intervention or pipeline restarts.

Learn why schema registry is essential for production streaming pipelines. Understand schema evolution, compatibility modes, and how to prevent breaking changes in real-time data.

TL;DR: • A schema registry is a centralized service that stores and enforces schemas for streaming data - it prevents producers from sending data that consumers cannot parse. • Compatibility modes (backward, forward, full) control what schema changes are allowed, preventing breaking changes from propagating through the pipeline. • Without a schema registry, streaming pipelines fail silently when upstream schema changes break downstream consumers - schema registry makes these failures loud and preventable.

Every streaming pipeline makes an implicit promise: the data a producer sends is the data a consumer expects to receive. When that promise holds, everything works. When it breaks - when a producer starts sending a field that used to be an integer as a string, or drops a column a downstream model depends on - things fail. Sometimes loudly. More often, silently.

Schema drift is the single most common cause of production streaming pipeline failures. A recent survey of data engineering teams found that over 60% of pipeline incidents trace back to unexpected schema changes. The fix is not more vigilant engineers or better runbooks. The fix is a contract - an enforceable, versioned agreement about what data looks like. That contract is called a schema, and the service that manages it is a schema registry.

This guide walks through what a schema registry does, why it matters, how it integrates with Kafka and CDC pipelines, and how to use it effectively in production.

What a Schema Registry Does

A schema registry is a centralized service that stores, versions, and enforces schemas for streaming data. It sits alongside your message broker (typically Kafka) and acts as the single source of truth for the structure of every topic.

The core responsibilities are straightforward:

Store schemas. Every schema is registered under a subject (usually tied to a Kafka topic) and assigned a unique numeric ID.
Version schemas. When a schema changes, the registry stores the new version alongside all previous versions, creating a complete history.
Enforce compatibility. Before a new schema version is accepted, the registry checks it against the existing versions using configurable compatibility rules. If the change would break consumers, the registration is rejected.

This means that a producer cannot push data into a topic unless the data’s schema passes the registry’s compatibility check. Consumers retrieve the schema by ID when deserializing, so they always know exactly what structure to expect.

Without a registry, the only way to discover a breaking schema change is when something downstream fails. With a registry, the breaking change is caught at the moment of registration - before a single bad record enters the pipeline.

Schema Formats

Three serialization formats dominate the streaming ecosystem. Each handles schema differently.

Feature	Avro	Protobuf	JSON Schema
Schema embedded in data	No (schema stored separately)	No (schema stored separately)	No
Schema language	Avro IDL / JSON	`.proto` files	JSON Schema spec
Binary format	Yes, compact	Yes, very compact	No, text-based
Default values	Supported	Supported	Supported
Schema evolution	Excellent	Excellent	Limited
Human readability	Low (binary)	Low (binary)	High (JSON)
Kafka ecosystem support	Native, first-class	Strong (Confluent 5.5+)	Supported
Typical use case	Kafka-native pipelines	gRPC + Kafka hybrid	REST APIs feeding into Kafka

Avro remains the most common choice in Kafka-centric architectures because it was designed alongside Kafka and has the deepest integration with the Confluent ecosystem. Protobuf is increasingly popular in organizations that already use gRPC for service-to-service communication. JSON Schema appeals to teams that prioritize human readability or have existing REST-based producers.

Regardless of the format, the schema registry workflow is the same: register the schema, get an ID, embed the ID in messages, retrieve the schema on the consumer side.

Compatibility Modes

Compatibility modes are the rules that govern what schema changes are allowed. They are the teeth of the contract. Choosing the right mode depends on how you deploy producer and consumer updates.

Backward Compatibility

New schemas can read data written with old schemas. This is the default mode in most registries and the safest choice for most teams.

Allowed changes:

Add a new field with a default value
Remove a field

Blocked changes:

Add a required field without a default
Change a field’s type

Deployment order: Upgrade consumers first, then producers.

Example: adding an optional discount_code field with a default of null to an orders schema is backward compatible. Consumers running the new schema can still read old messages that lack the field - they simply use the default.

Forward Compatibility

Old schemas can read data written with new schemas. Useful when you need to upgrade producers before consumers.

Allowed changes:

Add a field (old consumers ignore it)
Remove a field that has a default

Blocked changes:

Remove a required field without a default
Change a field’s type

Deployment order: Upgrade producers first, then consumers.

Full Compatibility

Both backward and forward compatible. Changes must be readable by both old and new schemas simultaneously.

Allowed changes:

Add a field with a default value
Remove a field that has a default value

This is the most restrictive but safest mode. It guarantees that producers and consumers can be upgraded in any order.

Transitive Variants

Each mode has a transitive variant (e.g., BACKWARD_TRANSITIVE, FULL_TRANSITIVE) that checks compatibility not just against the latest schema version, but against all previous versions. Use transitive modes when consumers may be reading historical data or when you cannot guarantee all consumers are on a recent schema version.

Schema Evolution in Practice

Not all schema changes carry the same risk. Here is a practical breakdown.

Adding a Column (Generally Safe)

Adding a new field with a default value is the most common and safest schema change. Under backward compatibility, old consumers ignore the new field. Under forward compatibility, old consumers use the default value.

// Version 1
{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "currency", "type": "string"}
  ]
}

// Version 2 - adding discount_code with default
{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "currency", "type": "string"},
    {"name": "discount_code", "type": ["null", "string"], "default": null}
  ]
}

Removing a Column (Depends on Mode)

Removing a field is backward compatible (new consumers do not need it) but not forward compatible unless the removed field had a default value. In full compatibility mode, you can only remove fields that have defaults.

Renaming a Column (Unsafe)

Renaming is effectively a remove-and-add operation. Most registries treat it as a breaking change under any compatibility mode. The safe approach is to add the new field, run both in parallel until all consumers have migrated, then remove the old field.

Changing a Field’s Type (Usually Unsafe)

Widening a type (e.g., int to long) is allowed in Avro because it is a promoted type. Narrowing (e.g., long to int) or changing categories (e.g., string to int) is always a breaking change.

Avro type promotions (safe):
  int → long → float → double
  bytes → string

Any other type change requires a new field or a new topic.

How Schema Registry Works with Kafka

The integration between a schema registry and Kafka is surprisingly simple. It relies on a convention: the first few bytes of every Kafka message value contain a schema ID rather than embedding the full schema.

Producer Flow

The producer serializes a record using Avro (or Protobuf/JSON Schema).
The serializer checks the registry: “Does this schema already exist for subject orders-value?”
If yes, it gets back the schema ID. If no, it registers the schema (compatibility check happens here) and gets a new ID.
The serializer prepends a magic byte (0x00) and the 4-byte schema ID to the serialized payload.
The message is sent to Kafka.

Consumer Flow

The consumer reads a message from Kafka.
The deserializer reads the magic byte and extracts the 4-byte schema ID.
It fetches the corresponding schema from the registry (cached locally after the first fetch).
It deserializes the payload using that schema and the consumer’s reader schema.
Schema resolution handles any differences (missing fields get defaults, extra fields are ignored).

Message wire format:
┌──────────┬───────────┬──────────────────────┐
│ 0x00     │ Schema ID │ Serialized payload   │
│ (1 byte) │ (4 bytes) │ (variable length)    │
└──────────┴───────────┴──────────────────────┘

This design means the schema is stored once in the registry rather than in every message, dramatically reducing message size compared to embedding the schema inline.

CDC and Schema Evolution

Change data capture introduces a unique schema evolution challenge. In a typical CDC pipeline, the schema is not defined by application developers writing Avro files - it is derived from the source database. When someone runs an ALTER TABLE statement, the pipeline schema must change to match.

Consider this sequence:

-- Original table
CREATE TABLE orders (
    order_id VARCHAR(36) PRIMARY KEY,
    amount DECIMAL(10,2),
    currency CHAR(3),
    created_at TIMESTAMP
);

-- Six months later, a developer adds a column
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(50) DEFAULT NULL;

-- Three months after that, a type gets widened
ALTER TABLE orders MODIFY COLUMN amount DECIMAL(12,4);

Each of these DDL statements changes the schema of the CDC events flowing into your Kafka topics. A well-designed CDC pipeline detects these changes automatically by monitoring the database’s transaction log or metadata, registers the new schema version with the registry, and continues streaming without interruption.

Streamkap handles this entire flow automatically. When a source database schema changes - a new column appears, a type widens - Streamkap detects the change, updates the pipeline schema, and propagates the modification to destination tables. New columns are added, types are adjusted, and the pipeline continues without manual intervention or restarts. This eliminates the most common operational burden in CDC: the on-call engineer who gets paged at 2 a.m. because someone added a column and the pipeline does not know about it.

Practical Example: E-Commerce Orders Schema Over Time

Let us trace a realistic example of an orders schema evolving over the life of a product.

Version 1 (Launch day):

{
  "type": "record",
  "name": "Order",
  "namespace": "com.example.ecommerce",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "amount", "type": "int", "doc": "Amount in cents"},
    {"name": "currency", "type": "string"},
    {"name": "created_at", "type": "long", "doc": "Unix timestamp millis"}
  ]
}

Version 2 (Month 3 --- adding discount support):

The team adds a nullable discount_code field. This is backward compatible because it has a default value. Consumers on Version 1 simply see null for the new field.

{"name": "discount_code", "type": ["null", "string"], "default": null}

Version 3 (Month 8 --- widening amount type):

Order values start exceeding the int range. The team widens amount from int to long. In Avro, int to long is a safe promotion. The schema registry accepts this change under backward compatibility because Avro readers can automatically promote int values to long.

{"name": "amount", "type": "long", "doc": "Amount in cents"}

Version 4 (Month 14 --- adding shipping address):

The team adds a nested shipping_address record with a default of null. All existing consumers continue to work. New consumers that care about shipping addresses can start reading the field immediately.

{
  "name": "shipping_address",
  "type": ["null", {
    "type": "record",
    "name": "Address",
    "fields": [
      {"name": "street", "type": "string"},
      {"name": "city", "type": "string"},
      {"name": "country", "type": "string"},
      {"name": "postal_code", "type": "string"}
    ]
  }],
  "default": null
}

Each of these changes was registered with the schema registry, checked for compatibility, and deployed without any consumer downtime. That is the value of the contract.

Schema Registry Options

Several schema registry implementations are available, each with different trade-offs.

Confluent Schema Registry is the original and most widely deployed. It supports Avro, Protobuf, and JSON Schema, integrates natively with Confluent Platform and Confluent Cloud, and provides a REST API for schema management. It uses Kafka itself as its storage backend, which simplifies operations.

AWS Glue Schema Registry is a managed option for teams running on AWS. It integrates with MSK (Managed Streaming for Kafka), Kinesis, and the broader AWS ecosystem. It supports Avro and JSON Schema. The managed nature eliminates operational overhead, but it locks you into the AWS ecosystem.

Apicurio Registry is an open-source alternative from Red Hat. It supports Avro, Protobuf, JSON Schema, GraphQL, and more. It can use Kafka, SQL databases, or in-memory storage as its backend. It is a strong choice for teams that want registry functionality without the Confluent licensing.

Karapace is another open-source option, originally developed by Aiven. It provides API compatibility with the Confluent Schema Registry, meaning you can swap it in as a drop-in replacement. It stores schemas in Kafka topics, just like the Confluent registry.

For most teams starting out, the Confluent Schema Registry - either self-hosted or via Confluent Cloud - is the safest default. It has the largest community, the most documentation, and the deepest integration with Kafka tooling.

Best Practices

Default to backward compatibility. It is the right choice for the vast majority of use cases. It lets you upgrade consumers first, which is the natural deployment order for most teams. Only switch to forward or full compatibility if you have a specific operational reason.

Test schema changes in staging. Before registering a new schema version in production, test it against your staging registry. Verify that consumers on the previous version can still deserialize data written with the new version. Automate this in CI.

Use a subject naming strategy. The default TopicNameStrategy (subject = topic name + -value or -key) works well for most cases. Use RecordNameStrategy if multiple record types share a topic, or TopicRecordNameStrategy for multi-type topics where you still want per-topic governance.

Never remove the compatibility check. It is tempting to set compatibility to NONE when you want to make a breaking change quickly. Resist this. Instead, create a new topic with the new schema and migrate consumers. The short-term convenience of disabling compatibility creates long-term operational debt.

Document your schemas. Avro supports a doc field on both records and individual fields. Use it. Six months from now, when someone is debugging a pipeline issue at midnight, that documentation is the difference between a 10-minute fix and a 2-hour investigation.

Version your schema files in source control. Treat .avsc, .proto, and .json schema files the same way you treat database migration files - checked into version control, reviewed in pull requests, and deployed through a pipeline. The schema registry is the runtime enforcement; source control is the audit trail.

Set up schema change alerts. Configure your registry to notify your team (via Slack, PagerDuty, or your alerting tool of choice) whenever a new schema version is registered. This creates visibility into schema evolution across the organization and prevents surprises.

These practices, combined with a schema registry, transform schema management from a reactive firefighting exercise into a proactive, governed process. The result is streaming pipelines that handle change gracefully instead of breaking under it.

Platform

Capabilities

Streamkap for...

Use Cases

By Destination

Compare

Learn

Company

Schema Registry in Stream Processing: Why Your Streams Need a Contract

What a Schema Registry Does

Schema Formats

Compatibility Modes

Backward Compatibility

Forward Compatibility

Full Compatibility

Transitive Variants

Schema Evolution in Practice

Adding a Column (Generally Safe)

Removing a Column (Depends on Mode)

Renaming a Column (Unsafe)

Changing a Field’s Type (Usually Unsafe)

How Schema Registry Works with Kafka

Producer Flow

Consumer Flow

CDC and Schema Evolution

Practical Example: E-Commerce Orders Schema Over Time

Schema Registry Options

Best Practices