Technology

PostgreSQL CDC Multi-Tenant Setups Done Right

AUTHOR BIO
Ricky has 20+ years experience in data, devops, databases and startups.

September 18, 2025

If you're building a modern SaaS platform, you know that delivering real-time, isolated analytics for each customer is a huge competitive advantage. Moving beyond slow, overnight batch jobs to event-driven updates is no longer a "nice-to-have"—it's essential for a great user experience. This is precisely where setting up PostgreSQL CDC in a multi-tenant environment comes in, solving the massive challenge of scaling data services for thousands of tenants efficiently.

Why Multi-Tenant CDC Is Essential for Modern SaaS

In the cutthroat SaaS world, real-time data is what fuels growth and keeps users from churning. Customers expect their dashboards, reports, and synced data to be instant. The old-school approach of running nightly ETL (Extract, Transform, Load) jobs just doesn't cut it anymore. That data lag frustrates users and gets in the way of smart decisions.

This is where Change Data Capture (CDC) becomes your secret weapon.

By using CDC with your multi-tenant PostgreSQL database, your application can react to data changes the moment they happen. Instead of constantly hammering your database with queries, CDC taps into the transaction log, catching every single insert, update, and delete in real-time.

This switch pays off immediately in a few key ways:

  • Real-Time Analytics: Your tenants see their data update on dashboards instantly, not 24 hours later.
  • Improved Performance: Reading from the transaction log means you avoid bogging down your primary production database with heavy, repetitive queries.
  • Data Synchronization: It's the engine that powers a seamless flow of data between microservices, keeping different parts of your application perfectly in sync for every single tenant.

The Problem with Traditional Methods

Let's be honest: traditional batch processing is a nightmare in a multi-tenant world. A single, monolithic batch job for all tenants is painfully slow, hogs resources, and creates massive data delays. It's an approach that completely falls apart as you start onboarding hundreds or thousands of new customers.

The rapid move toward CDC is a direct response to this problem and is now a core part of the streaming ETL market. To get a better handle on this shift, it's worth understanding the fundamentals of change data capture for streaming ETL.

The industry is voting with its wallet. The ETL market, which includes CDC, is on track to explode from $8.85 billion in 2025 to $18.60 billion by 2030. This isn't just a trend; it's a fundamental move away from outdated batch processing.

Addressing Multi-Tenant Complexities

Of course, building a PostgreSQL CDC multi-tenant pipeline isn't without its own set of puzzles. The biggest one? Ensuring absolute data isolation. You can never let one tenant see another's data.

You also have to tackle the classic "noisy neighbor" problem, where one hyper-active tenant could flood the data pipeline and degrade performance for everyone else. A well-designed CDC architecture is built from the ground up to manage these issues, giving you a scalable, secure way to deliver real-time value to every single customer.

Picking the Right PostgreSQL Multi-Tenant Schema

How you structure your multi-tenant setup in PostgreSQL is more than just a database design choice—it's the foundation of your entire CDC pipeline. Get it right, and you'll have a system that isolates data cleanly and scales beautifully. Get it wrong, and you're in for a world of performance headaches, security vulnerabilities, and operational nightmares.

A little forethought here goes a long way in building a stable multi-tenant CDC environment.

Image

The image above gives you a high-level look at the server architecture behind these models. It really drives home the point that your infrastructure needs to be planned around how you separate tenant data, whether that's through distinct databases, schemas, or a shared-table approach.

When it comes to a PostgreSQL CDC multi-tenant setup, you’re generally looking at three main models. Each one forces you to make some tough trade-offs. For instance, giving every tenant their own database offers fortress-like isolation but ramps up engineering complexity and costs. On the other side of the spectrum, piling tenants into shared tables is efficient but opens the door to "noisy neighbor" problems.

Let's unpack these options.

Comparison of PostgreSQL Multi-Tenancy Models

The decision really comes down to a classic balancing act: how much are you willing to trade operational cost and scalability for complete data isolation? A startup catering to a few enterprise clients might lean toward the total separation of a database-per-tenant model. In contrast, a SaaS platform with thousands of users will almost certainly gravitate toward a more efficient, shared architecture.

This table breaks down the core trade-offs to help guide your decision.

ApproachData IsolationScalabilityCost EfficiencyBest For
Database per TenantExcellent (Highest)Low (Hard to scale)Low (High overhead)Applications with a few high-value tenants needing maximum security and customization.
Schema per TenantGood (Strong)MediumMediumPlatforms that need strong logical separation without the overhead of managing many databases.
Shared SchemaFair (Requires RLS)Excellent (Highest)High (Most efficient)High-volume SaaS applications that require massive scalability and cost-effective operations.

Ultimately, choosing the right model is about aligning your architecture with your business needs. Each path has its own set of challenges and benefits, so it's critical to understand them before you commit.

A Closer Look at the Shared-Schema Model

For the vast majority of modern SaaS applications, the shared-schema model using a tenant_id column is the way to go. It provides the clearest path to scalability and cost savings, making it the most practical choice for any growing business. Its success, however, depends entirely on how well you enforce data isolation at the application and database levels.

This is where your table design becomes absolutely critical. Any table containing tenant-specific data must include a tenant_id column.

This isn't just another column; it's the linchpin of your entire multi-tenant strategy.

  • Filtering Data: Every single query must have a WHERE tenant_id = 'current_tenant' clause. No exceptions. This is your first line of defense against data leakage.
  • Enforcing Security: It’s the essential building block for PostgreSQL's Row-Level Security (RLS), a powerful feature we'll get into later.
  • Boosting Performance: You absolutely need to create a B-tree index on the tenant_id column. If you have composite primary keys, tenant_id should almost always be the first column in that key.

Pro Tip: I can't stress this enough: make foreign keys your best friend in a shared-schema model. For example, a projects table should have a foreign key that includes tenant_id, which then references the users table's tenant_id. This creates a rock-solid, database-level guarantee that a user from Tenant A can never be accidentally linked to a project from Tenant B.

If you’re dealing with massive tables, you can take this even further with PostgreSQL's declarative partitioning. By partitioning tables by tenant_id (using either list or hash partitioning), you can physically group a tenant's data on disk. This can lead to huge performance gains for queries and makes maintenance tasks, like tenant-specific backups or data archival, much simpler. Getting this architecture right from the start will set you up for a scalable and secure PostgreSQL CDC multi-tenant environment.

Getting PostgreSQL Ready for Logical Replication

Before you can get data flowing for your multi-tenant app, you need to get your PostgreSQL instance ready to share its transaction log. This is done through logical replication, which is really the engine behind modern Change Data Capture (CDC). It’s how tools like Streamkap can tap into your database’s change stream safely and without bogging down performance.

Image

The setup comes down to a few key tweaks in your postgresql.conf file. These settings dictate what PostgreSQL writes to its Write-Ahead Log (WAL) and how many clients can connect to stream those changes. Nailing these values is absolutely essential for building a PostgreSQL CDC multi-tenant pipeline that’s both stable and can handle the load.

Fine-Tuning Your Configuration File

First things first, you'll need to find and edit your postgresql.conf file. Its location can differ based on your setup, but it’s usually hanging out in your PostgreSQL data directory. Once you’ve got it open, there are three parameters that need your attention.

  • wal_level: This one is non-negotiable. The default is often replica, which is fine for physical replication but won't cut it for CDC. You must change this to logical. This setting tells PostgreSQL to pack all the rich, row-level details of every change into the WAL.
  • max_wal_senders: Think of this as the number of open lanes for streaming data out of your database. Each CDC pipeline needs a connection. For a single pipeline, setting this to 10 is a solid, safe place to start.
  • max_replication_slots: A replication slot is your safety net. It guarantees the primary server won't delete WAL files before your CDC tool has processed them, which is critical for preventing data loss. A value of 10 here also provides a comfortable buffer.

A quick but important heads-up: after you save these changes, you have to do a full restart of the PostgreSQL server. A simple reload won't apply these particular settings.

Creating a Dedicated Replication User

Your next move is to create a specific user just for your CDC process. It's tempting to use a superuser account, but that's a major security no-go. A dedicated user with just enough permissions is the way to go—it’s all about following the principle of least privilege to keep your production database locked down.

A couple of straightforward SQL commands are all you need to create the user and assign the right roles. This user really only needs REPLICATION and LOGIN privileges.

CREATE ROLE streamkap_user WITH REPLICATION LOGIN PASSWORD 'a-very-strong-password';GRANT rds_replication TO streamkap_user; -- On RDS/Aurora

After the user is created, you’ll also want to grant it SELECT privileges on the tables you'll be streaming. This lets the CDC tool grab an initial snapshot of the data when it first connects.

If you're on a managed database service like Amazon RDS, the steps are a bit different but the core idea is the same. For a fantastic step-by-step guide, check out this article on how to set up AWS RDS PostgreSQL for CDC.

Once these configurations are locked in, your database is officially ready. It can now broadcast all data changes, setting the stage for a powerful, real-time multi-tenant data architecture.

Setting Up Your Streamkap CDC Pipeline

Alright, with your PostgreSQL database ready for logical replication, it's time to hook it up to a streaming platform and bring that multi-tenant design to life. This is where the rubber meets the road—translating your schema into a real, flowing data pipeline. We'll use Streamkap for this, and while it's pretty intuitive, a few specific tweaks for multi-tenancy will save you a world of headaches down the line.

Building a pipeline isn't just about punching in credentials. You need to tell the system exactly what to watch and how to handle it, especially in a PostgreSQL CDC multi-tenant setup where new tenants and their tables will pop up all the time.

Connecting to Your PostgreSQL Source

First things first, let's get connected. Inside the Streamkap UI, you'll create a new PostgreSQL Source Connector. This part is standard fare: you'll need the hostname, port, database name, and the dedicated streamkap_user you created earlier.

For a full rundown of every option, our Streamkap PostgreSQL source connector guide has you covered.

Once you plug in the details, Streamkap runs a quick check to make sure the user has the right replication permissions. A green light here means you're ready for the fun part: telling it what data to capture.

Nailing Your Table Inclusion Strategy

This is the most important step for a scalable multi-tenant architecture. Manually adding tables every time a new customer signs up? That’s a recipe for disaster. Instead, you'll want to use regular expressions in the "Tables & Schemas" configuration to create a dynamic "allow list."

Let's say you're using a schema-per-tenant model, where each schema is named something like tenant_<tenant_id>. Your regex could be as simple as:

tenant_.*\.orders

Boom. Just like that, the orders table from any new tenant schema you create will be captured automatically. No manual intervention needed. This is how you build a system that scales with you. You can also use the "Exclude" list to filter out noisy internal schemas you don't care about, like information_schema or pg_catalog.

Real-World Tip: I've seen a lot of teams use a shared schema but prefix tables with a tenant ID (e.g., t1_users, t2_users). A simple regex like t[0-9]+_.* is perfect for this. It grabs all the tenant-specific tables and ignores shared ones like plans or features. This one little trick can save you hundreds of hours of manual config work over the life of your product.

Choosing the Right Snapshot Mode

Finally, you have to decide how to handle the initial data load. This is called the snapshot mode. For a live multi-tenant database, you’ll likely be looking at two main options in Streamkap:

  • initial: This takes a complete snapshot of every included table the first time the pipeline runs. It’s the go-to choice for new projects where you need to backfill all historical data into your destination. Just a heads-up: if you have thousands of tenants, this initial snapshot can take a while and will hit your database resources, so plan accordingly.
  • never: Pick this if you only want to capture changes from this moment forward. It’s a great option if you've already migrated the historical data through a different process and just need to start streaming new changes.

Most people start with initial. After the snapshot is done, the connector flips over to streaming live changes directly from the WAL. From that point on, your multi-tenant data will be flowing in real time, just as you planned.

Using Row-Level Security for True Data Isolation

Your application logic is the first line of defense for keeping tenant data separate, but you can't stop there. Let's be realistic: a single bug or a misconfigured CDC connector could accidentally bleed one tenant's data into another's stream. In a multi-tenant world, that’s not just an error; it's a catastrophe.

This is exactly why you need PostgreSQL's native Row-Level Security (RLS). It's your ultimate safety net.

Think of RLS as a firewall built right into your database, operating at the row level. It guarantees that no matter how a user or service connects, they only see the data they're explicitly allowed to see. For a PostgreSQL CDC multi-tenant pipeline, this is a complete game-changer. It means that even if your CDC tool tries to pull the entire table, PostgreSQL itself steps in and filters the results based on the replication user's identity. Data leakage becomes virtually impossible.

How to Create a Tenant-Aware Security Policy

Getting RLS up and running is a two-step dance. First, you flip the switch to enable it on a table. Then, you write the rules—the policies—that dictate who sees what.

Let's walk through a common scenario. Imagine a shared orders table with a tenant_id column. Our goal is to lock down our dedicated streamkap_user so it can only see rows for one specific tenant. The trick is to set a session variable for that user that tells PostgreSQL which tenant_id it's associated with.

Here’s how you’d actually do it:

  • First, enable RLS on the table. This tells PostgreSQL to start enforcing policies for any query hitting the orders table.ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
  • Next, create the access policy. This is the magic part. The policy links the tenant_id in the table to a session variable we’ll define.CREATE POLICY tenant_isolation_policy ON ordersFOR SELECTUSING (tenant_id::text = current_setting('app.tenant_id'));Once that's in place, the streamkap_user can only read rows where the orders.tenant_id matches the app.tenant_id variable set for its session. If that variable isn't set? The user sees nothing. Simple, effective, and secure. You've just created an isolated data stream right at the database level.

The Power of Database-Enforced Isolation

Enforcing security this deep in the stack is an incredibly powerful concept. Row-Level Security has become a cornerstone of modern multi-tenancy in PostgreSQL because it bakes access policies directly into the data itself. It often removes the need for complex and brittle database partitioning schemes, which is a huge win for developer productivity and simpler operations.

By putting the security burden on the database layer, RLS acts as your final backstop. It’s your guarantee that even with a buggy application or a misconfigured CDC tool, tenant data remains strictly siloed. This is absolutely critical for maintaining compliance with regulations like GDPR.

For even more robust privacy in your multi-tenant setup, you might also want to look into advanced data anonymization strategies for PostgreSQL. But make no mistake: implementing RLS isn't just a "nice-to-have." It's a non-negotiable step for building a secure, trustworthy, and scalable multi-tenant CDC architecture.

Common Questions on Multi-Tenant CDC

When you start architecting a PostgreSQL CDC multi-tenant setup, you quickly move past the high-level concepts and into the nitty-gritty details. Getting this right in a live production environment means thinking hard about performance, schema evolution, and how you'll route all that data. Let's dig into the questions that come up time and time again.

Image

Will CDC Affect My Database Performance?

This is always the first question, and for good reason. No one wants to bring their production database to its knees.

The good news is that logical replication is designed to be incredibly lightweight. It works by reading directly from the Write-Ahead Log (WAL)—the same journal PostgreSQL uses for crash recovery—not by querying your live tables. This means you avoid putting any extra query load on your primary database.

But, there's a catch, especially in a busy multi-tenant world. You have to keep an eye on WAL generation. If a replication slot becomes inactive or starts lagging significantly, it can stop PostgreSQL from cleaning up old WAL files. This backlog can grow and, in a worst-case scenario, fill up your disk.

The trick is to make sure your CDC consumer is always keeping up. Proactive monitoring of replication lag and disk usage isn't just a good idea; it's absolutely essential for a healthy system.

How Should I Handle Tenant Schema Changes?

Managing schema changes (DDL) is where many multi-tenant strategies live or die. Modern CDC connectors, often powered by tools like Debezium, are built to capture and stream these DDL changes right alongside your data. When you add a new column to a tenant's table, that event flows through the pipeline, letting downstream systems know about the new structure.

The best practice here is to have a fully automated and well-tested migration workflow. Your consumers, whether they are microservices or your data warehouse, need to be designed for graceful schema evolution. This means they can handle new fields without crashing, ensuring your data pipelines don't break every time a developer ships a feature.

In a multi-tenant SaaS application, you could easily be dealing with hundreds of tables across dozens or even hundreds of schemas. Building pipelines that can adapt to schema changes without someone having to intervene manually is a game-changer. This is where an ELT approach really shines, as you can land the raw data and handle transformations in the destination.

Can I Filter Data for Specific Tenants?

You might think filtering should happen right at the source, but with PostgreSQL logical replication, that's not the typical approach. The real power comes from handling the routing and filtering within your streaming platform. This is exactly where a tool like Streamkap becomes invaluable.

You can set up transforms directly in the pipeline to route events based on a tenant_id (or whatever you call it) in the data payload.

  • Send all events for tenant-A to its own dedicated Kafka topic.
  • Route data for tenant-B to a completely separate Kinesis stream.
  • Filter out any internal or test tenants so that junk data never even makes it to your data warehouse.

This strategy gives you strict data isolation not just at the database level, but all the way through your data infrastructure. It's the key to building a secure, organized, and truly event-driven architecture.