<--- Back to all resources
Streaming Data Catalog: Documenting and Discovering Real-Time Data Assets
How to build a data catalog for streaming systems. Covers topic registries, schema registries, lineage metadata, discovery tools, and practical patterns for documenting real-time data assets in Kafka and Flink pipelines.
You have 200 Kafka topics. A new engineer joins the team. They need to find the topic that contains customer order events with payment information. What do they do?
If you do not have a streaming data catalog, the answer is: they search Confluent Control Center (or Redpanda Console, or whatever UI you have) for topic names that look promising. They find orders, order-events, payments.orders.v2, and commerce.public.orders. They do not know which is the canonical source vs. a derived topic vs. a deprecated legacy topic. They check the schema registry and find Avro schemas, but the field names are cryptic (amt_cents, cust_ref, ts_ms). They ask in Slack. Someone who joined three years ago remembers that commerce.public.orders is the CDC topic from the production database, order-events is a Flink-derived enriched stream, and payments.orders.v2 was created for a project that got cancelled but never cleaned up.
This is the data discovery problem, and it gets worse as your streaming platform grows. A data catalog solves it by providing a single place to find, understand, and trust your streaming data assets.
What a Streaming Data Catalog Contains
A data catalog for streaming systems needs to capture metadata at three levels:
1. Topic-Level Metadata
Each Kafka topic should have:
- Name and description: What events does this topic contain? In plain English.
- Owner: Which team is responsible for the topic’s content and availability?
- Schema: Link to the schema registry subject and current version.
- Partitioning: What key is used, how many partitions, and why that key was chosen.
- Retention: How long are events kept? Is compaction enabled?
- Data classification: Does the topic contain PII? PHI? Financial data? This determines who can access it and what compliance requirements apply.
- SLA: What is the expected freshness? What latency guarantees exist?
- Creation date and last modified date.
2. Schema-Level Metadata
The schema registry provides structural metadata (field names, types, compatibility mode), but not semantic metadata. A field called ts_ms is a BIGINT, but the catalog should document that it represents “the timestamp in milliseconds when the event occurred in the source database, in UTC.”
For each schema field:
- Description: What does this field mean in business terms?
- Source: Where does this value originate?
- Sensitivity: Is this field PII, and does it require masking for certain consumers?
- Nullability semantics: A null
discount_codemeans “no discount applied,” not “unknown.”
3. Pipeline-Level Metadata (Lineage)
How do topics relate to each other? Which Flink jobs read from which topics and write to which others? This is lineage metadata, and without it, you cannot answer questions like:
- “If I change the schema of topic A, what downstream systems will break?”
- “Where does the data in this dashboard ultimately come from?”
- “This topic has stale data. Which upstream system is the source of the problem?”
We will cover lineage in more detail in our companion article on data lineage in streaming pipelines, but the catalog should at minimum show upstream and downstream connections for each topic.
The Schema Registry: Necessary but Not Sufficient
If you are using Kafka with Avro, Protobuf, or JSON Schema, you likely already have a schema registry (Confluent Schema Registry, Apicurio, or Karapace). This is the foundation of your catalog, but it is not the catalog itself.
What the Schema Registry Gives You
- Schema evolution management: Backward, forward, and full compatibility checks prevent producers from publishing breaking changes.
- Schema versioning: Every schema change creates a new version, and you can see the history.
- Type safety: Consumers can deserialize records against a known schema instead of parsing raw bytes.
- Cross-language support: Avro and Protobuf schemas work across Java, Python, Go, and other languages.
What the Schema Registry Does Not Give You
- Business context: The schema says
cust_refis a STRING. The catalog says “this is the customer’s external reference ID, assigned by the billing system, unique per tenant.” - Ownership: Who do you contact when this schema needs to change?
- Data classification: The schema does not flag which fields contain PII.
- Lineage: The schema registry does not know that
topic-Ais consumed by Flink job X to producetopic-B. - Discoverability: Schema registries are searchable by subject name, not by “I need customer payment data.”
This is why you need a catalog layer on top of the schema registry.
Building a Catalog: Start Simple
You do not need a dedicated catalog tool on day one. Start with a documentation-as-code approach that scales with your team.
The YAML-per-Topic Approach
Create a directory in your pipeline repository where each topic has a metadata file:
streaming-catalog/
├── topics/
│ ├── commerce.public.orders.yaml
│ ├── commerce.public.customers.yaml
│ ├── order-events-enriched.yaml
│ └── payments.orders.v2.yaml # marked deprecated
└── schemas/
└── field-descriptions/
├── orders.yaml
└── customers.yaml
A topic metadata file:
# commerce.public.orders.yaml
name: commerce.public.orders
description: >
CDC stream from the production orders table in the commerce PostgreSQL
database. Contains every insert, update, and delete as a Debezium-formatted
event. This is the canonical source of truth for order data.
owner: data-platform-team
contact: "#data-platform-support"
source:
type: cdc
database: commerce-production
table: public.orders
connector: streamkap-postgres-commerce
schema:
registry_subject: commerce.public.orders-value
format: avro
compatibility: backward
partitioning:
key: order_id
count: 24
rationale: "Primary key partitioning for per-order ordering"
retention:
policy: delete
duration: 7d
compaction: false
classification:
pii_fields: [customer_email, shipping_address]
data_tier: critical
sla:
max_latency: 30s
availability: 99.9%
consumers:
- name: order-enrichment-flink-job
team: data-platform-team
purpose: "Enrich orders with customer data, write to order-events-enriched"
- name: snowflake-orders-sync
team: analytics-team
purpose: "Load orders into Snowflake for analytics"
status: active
created: 2025-03-15
last_reviewed: 2026-01-10
A field descriptions file:
# schemas/field-descriptions/orders.yaml
subject: commerce.public.orders-value
fields:
order_id:
description: "Unique identifier for the order, auto-generated by PostgreSQL"
source: "commerce.public.orders.id (BIGSERIAL)"
pii: false
customer_email:
description: "Email address of the customer who placed the order"
source: "Joined from commerce.public.customers at write time"
pii: true
masking: "Hash for non-production consumers"
amt_cents:
description: "Order total in cents (USD). Divide by 100 for dollar amount."
source: "Calculated at checkout: sum of line items + tax - discounts"
pii: false
ts_ms:
description: "Timestamp when the event occurred in the source database, in milliseconds since epoch, UTC"
source: "Debezium source.ts_ms field"
pii: false
Validation in CI
Add a CI step that validates your catalog files:
# validate_catalog.py
import yaml
import sys
from pathlib import Path
REQUIRED_FIELDS = ['name', 'description', 'owner', 'source', 'schema',
'partitioning', 'retention', 'classification', 'status']
errors = []
for path in Path('streaming-catalog/topics').glob('*.yaml'):
with open(path) as f:
topic = yaml.safe_load(f)
for field in REQUIRED_FIELDS:
if field not in topic:
errors.append(f"{path.name}: missing required field '{field}'")
if errors:
for e in errors:
print(f"ERROR: {e}")
sys.exit(1)
print(f"Validated {len(list(Path('streaming-catalog/topics').glob('*.yaml')))} topic files")
This ensures that every topic has minimum required documentation before a pipeline change can be merged.
When to Graduate to a Catalog Tool
The YAML approach works well up to 50-100 topics with 5-10 teams. Beyond that, you hit limitations:
- Search: You need full-text search across descriptions, not just filenames.
- Lineage visualization: YAML files can list consumers, but you need a graph view to understand multi-hop dependencies.
- Automated metadata: Manually updating consumer lists is unreliable. You need automated discovery.
- Access control: Different teams need different views of the catalog.
At this point, consider a dedicated tool.
Catalog Tools for Streaming
DataHub
DataHub (by LinkedIn, open source) has first-class support for Kafka topics, schema registries, and Flink jobs. It can:
- Automatically ingest topic metadata and schemas from Kafka and the schema registry
- Display lineage between topics, Flink jobs, and downstream warehouses
- Support custom metadata aspects for fields like PII classification and SLAs
- Provide a search interface that indexes descriptions and tags
DataHub’s Kafka metadata ingestion pulls topic configuration, schema registry subjects, and consumer group information on a scheduled basis. You supplement this with manual annotations for business context.
OpenMetadata
OpenMetadata is another open-source option with Kafka and Flink connectors. It focuses on a collaborative, wiki-like experience where teams annotate assets directly in the UI. It supports:
- Topic and schema ingestion
- Data quality metadata (freshness, volume, schema drift)
- Glossary terms that map business concepts to technical assets
Confluent Stream Catalog
If you use Confluent Platform, Stream Catalog is built into Confluent Cloud. It integrates directly with Schema Registry and provides business metadata, tagging, and search. The trade-off is vendor lock-in: it only catalogs assets within Confluent.
Practical Patterns for Catalog Maintenance
Topic Naming Conventions
A consistent naming convention is the cheapest form of documentation. If every CDC topic follows the pattern {database}.{schema}.{table} and every derived topic follows {domain}.{processing-step}.{version}, engineers can parse topic names without looking anything up.
Examples:
# CDC topics (Streamkap convention)
commerce.public.orders
commerce.public.customers
billing.public.invoices
# Derived topics
analytics.orders-enriched.v2
ml.customer-features.v1
notifications.order-events.v1
Document the convention in your catalog’s README, and enforce it with a topic creation policy (Kafka topic creation ACLs or a self-service portal that validates names).
Ownership Assignment
Every topic must have an owner. The owner is responsible for:
- Responding to questions about the topic’s content and semantics
- Reviewing and approving schema changes
- Maintaining freshness SLAs
- Deciding on deprecation and removal
Use team ownership, not individual ownership. People leave, change teams, and go on vacation. A team-owned topic always has someone available.
Deprecation Workflow
When a topic needs to be retired:
- Mark it as
status: deprecatedin the catalog with a deprecation date and migration path. - Add a deprecation notice to the topic description: “Deprecated as of 2026-02-01. Use
commerce.orders-enriched.v2instead.” - Notify all listed consumers via their team channels.
- Set a sunset date. On that date, reduce retention to 1 hour, then delete.
- Remove the catalog entry (or move it to an archive section).
Without this workflow, deprecated topics linger indefinitely, confusing new engineers who discover them and spend time understanding data that nobody uses.
Automated Freshness Monitoring
A catalog entry that says “SLA: max 30s latency” is only useful if you verify it. Connect your monitoring system to the catalog:
# Check freshness SLA for each catalog topic
for topic in catalog.get_active_topics():
latest_offset_time = kafka_admin.get_latest_timestamp(topic.name)
age_seconds = (now() - latest_offset_time).total_seconds()
if age_seconds > topic.sla.max_latency_seconds:
alert(f"Topic {topic.name} exceeds SLA: "
f"{age_seconds}s > {topic.sla.max_latency_seconds}s")
Streamkap provides pipeline monitoring that tracks ingestion latency and consumer lag for each connector. This data can feed into your catalog’s freshness metrics, giving you a single view of whether each data asset meets its SLA.
Making the Catalog Useful
A catalog that nobody reads is worse than no catalog because it creates a false sense of documentation. To make it useful:
Make it the first stop for onboarding. When a new engineer joins, their onboarding checklist should include “browse the streaming catalog and identify the topics your team owns and consumes.”
Link from code to catalog. In your Flink job configurations and Kafka consumer configs, add comments with catalog URLs:
// Catalog: https://catalog.internal/topics/commerce.public.orders
// Owner: data-platform-team (#data-platform-support)
Properties props = new Properties();
props.setProperty("topic", "commerce.public.orders");
Make it searchable. Engineers should be able to type “customer payment” and find the relevant topic, even if the topic name is commerce.public.orders. This requires indexed descriptions, tags, and field-level documentation.
Review quarterly. Schedule a quarterly review where each team verifies their catalog entries are still accurate. Update descriptions, consumer lists, and SLAs. Archive deprecated entries. This is 30 minutes per team per quarter and prevents catalog rot.
The streaming data catalog is not glamorous infrastructure. It does not improve throughput or reduce latency. But it solves the single biggest scaling bottleneck in data-intensive organizations: the ability for engineers to find, understand, and trust the data flowing through your platform. Build it incrementally, automate what you can, and make it a living document that evolves with your streaming infrastructure.