<--- Back to all resources

Engineering

February 25, 2026

10 min read

Data Masking in Streaming Pipelines: PII Protection in Real Time

Learn how to mask, hash, and redact PII in real-time streaming pipelines. Implement data masking for GDPR, HIPAA, and SOC 2 compliance without slowing down your data flow.

TL;DR: • Data masking in streaming applies transformations like hashing, redaction, and tokenization to PII fields as data flows through the pipeline, before it reaches the destination. • Common techniques include SHA-256 hashing for pseudonymization, pattern-based redaction for emails and phone numbers, and field-level encryption for reversible masking. • Implementing masking at the pipeline level (not the application or destination) ensures consistent protection regardless of who or what consumes the data.

Every record that flows through a streaming pipeline is a potential liability. A customer’s email address, a patient’s diagnosis code, a user’s Social Security number - each field of personally identifiable information (PII) that reaches a destination unmasked is a compliance violation waiting to happen and a breach that could cost your organization millions.

The challenge is not whether to protect sensitive data - every regulation from GDPR to HIPAA to SOC 2 demands it. The challenge is protecting it in motion, as records flow through real-time pipelines at thousands of events per second, without introducing latency or gaps that leave data exposed in some systems but not others.

This guide covers the techniques, architecture patterns, and practical details for masking PII in streaming pipelines.

Masking Techniques for Streaming Data

The right technique depends on your compliance requirements, whether the masking needs to be reversible, and how the downstream data will be used.

SHA-256 Hashing

Hashing applies a one-way cryptographic function to a field, producing a fixed-length output that cannot be reversed to recover the original value. SHA-256 is the standard choice.

-- Pseudonymize an email address
SELECT SHA2(email, 256) AS email_hash FROM users;
-- Input:  john.doe@example.com
-- Output: 836f82db99121b3481011f16b49dfa5fbc714a0d1b1b9f784a1ebbbf5b39577f

Hashing is ideal for pseudonymization - you can still join records on the hashed value (the same input always produces the same output), but you cannot recover the original email. This satisfies GDPR’s pseudonymization requirements and is the most common technique for streaming pipelines because it is fast, deterministic, and requires no external dependencies.

Limitation: Hashing is vulnerable to rainbow table attacks on low-cardinality fields. If you hash a field with a small set of possible values (e.g., gender or boolean flags), an attacker can precompute all hashes and reverse them. Use HMAC with a secret key for low-cardinality fields, or add a salt.

Pattern-Based Redaction

Redaction replaces characters in a field while preserving partial structure, useful when downstream systems need to verify format without seeing the full value.

import re

def redact_email(email: str) -> str:
    """john.doe@example.com → j******@e******.com"""
    local, domain = email.split('@')
    return f"{local[0]}{'*' * (len(local) - 1)}@{domain[0]}{'*' * (len(domain.split('.')[0]) - 1)}.{domain.split('.')[-1]}"

def redact_ssn(ssn: str) -> str:
    """123-45-6789 → ***-**-6789"""
    return re.sub(r'^\d{3}-\d{2}', '***-**', ssn)

Redaction is human-readable and makes troubleshooting easier than opaque hashes, but it is not suitable when you need to join on the masked value.

Field Nullification

The simplest approach: replace the entire value with NULL. This is appropriate when the field has no analytical value and should never appear in the destination at all.

-- Nullify sensitive fields during ingestion
SELECT
    order_id,
    product_id,
    quantity,
    NULL AS customer_ssn,
    NULL AS credit_card_number
FROM orders;

Nullification is zero-cost computationally and eliminates any risk of PII leakage. The trade-off is total data loss for that field - you cannot join on it, filter by it, or recover it.

Tokenization

Tokenization replaces a sensitive value with a randomly generated token and stores the mapping in a secure vault. Unlike hashing, tokenization is reversible - if you have access to the vault, you can retrieve the original value.

import uuid

token_vault = {}  # In practice: AWS DynamoDB, HashiCorp Vault, etc.

def tokenize(value: str) -> str:
    if value not in token_vault:
        token_vault[value] = str(uuid.uuid4())
    return token_vault[value]

# tokenize("john.doe@example.com") → "a3f8c1d2-7e4b-4a9f-b6c1-8d2e3f4a5b6c"

Tokenization is the right choice when some authorized systems need to recover the original value while analytical systems work with the token. The trade-off is latency: every operation requires a vault lookup.

Format-Preserving Encryption (FPE)

FPE encrypts a value while preserving its original format and length. A 16-digit credit card number encrypts to a different 16-digit number. A 9-digit SSN encrypts to a different 9-digit string.

-- Format-preserving encryption examples
Input:  4532-1234-5678-9012  →  Output: 8291-7463-0285-1947
Input:  123-45-6789          →  Output: 847-29-3156

FPE is useful when downstream systems have strict format validation (e.g., a field must be exactly 16 digits) and cannot accept hashes or tokens. It is reversible with the encryption key, making it suitable for environments where authorized decryption is required. The downside is higher CPU cost compared to hashing or redaction.

Where to Apply Masking: Source, Pipeline, or Destination

You have three options, and only one is consistently correct.

Source-Level Masking

Masking at the source means the application writes pre-masked data to the database. Problem: Your transactional application probably needs the real data. You cannot hash a customer’s email in your production database - your application needs it to send emails.

Destination-Level Masking

Masking at the destination means raw PII flows through the pipeline and is masked by the destination system (e.g., a Snowflake masking policy or BigQuery column-level security). Problem: PII is exposed during transit and in intermediate systems (Kafka topics, staging tables, logs). You also have to configure masking independently in every destination, creating consistency gaps.

Pipeline-Level Masking

Pipeline-level masking is the best of both worlds. The source retains the original data for application use. The pipeline transforms PII fields inline before they reach any destination. Every downstream system receives consistently masked data.

This is the approach Streamkap takes: field-level transformations are configured in the pipeline, and masking is applied as records flow through, before data touches Kafka, Snowflake, BigQuery, or any other system.

Implementation Patterns

Field-Level Transforms

The most common pattern is declaring masking rules per field in your pipeline configuration:

# Pipeline masking configuration
transforms:
  - field: "customers.email"
    type: "sha256_hash"
  - field: "customers.ssn"
    type: "redact"
    pattern: "***-**-{{last4}}"
  - field: "customers.phone"
    type: "regex_replace"
    match: "(\d{3})-(\d{4})$"
    replace: "***-****"
  - field: "orders.credit_card"
    type: "nullify"

The transform engine processes fields before serialization to the destination, so the masked value is what gets written.

Regex-Based Masking

For unstructured text fields (support tickets, user comments, log messages) where PII can appear anywhere in the string, regex-based masking scans for patterns and redacts them:

import re
EMAIL_PATTERN = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')

def mask_emails_in_text(text: str) -> str:
    return EMAIL_PATTERN.sub('[REDACTED_EMAIL]', text)

# Input:  "Contact john@example.com or support@company.org for help"
# Output: "Contact [REDACTED_EMAIL] or [REDACTED_EMAIL] for help"

Lookup-Based Tokenization

For tokenization at scale, batch vault lookups to minimize latency. Buffer records into micro-batches and perform a single batch lookup to amortize the network round-trip. A naive per-record implementation will bottleneck the entire pipeline on vault latency.

Masking in CDC Pipelines

CDC introduces specific challenges for data masking that batch pipelines do not face.

Masking Change Events

In a CDC pipeline, every record is a change event: an insert, update, or delete. Masking must be applied consistently across all event types. When a customer updates their email, the CDC event contains both old and new values, and both must be masked.

{
  "op": "u",
  "before": {
    "id": 12345,
    "email": "836f82db..."
  },
  "after": {
    "id": 12345,
    "email": "a1b2c3d4..."
  }
}

The masking transform runs on the before and after payloads independently, ensuring that neither the old nor new value leaks through in plaintext.

Handling Re-Masking on Updates

With deterministic masking (SHA-256), the same input always produces the same output. An update from john@example.com to jane@example.com produces two different hashes, correctly reflecting the change. But with non-deterministic masking (tokenization), the same input must map to the same token across events. Otherwise, a no-op update appears as a change. Use a persistent vault with consistent mappings.

Delete Events and Tombstones

CDC delete events must also be masked. If your pipeline emits the deleted record’s full payload before a tombstone, that payload must be masked just like any other event.

Compliance Considerations

GDPR: Right to Erasure and Hashing

GDPR’s Article 17 grants individuals the right to erasure. Pseudonymized data (including hashes) is still personal data if it can be linked back to an individual. If you retain a mapping table connecting hashes to identities, the hashes must be deletable.

Practical approach: Use a keyed HMAC hash. To “erase” a user, destroy the key material for that user’s hash. Without the key, the hash becomes computationally irreversible and no longer qualifies as personal data under most interpretations.

HIPAA De-Identification

HIPAA’s Safe Harbor method requires removing 18 specific identifier types from protected health information (PHI), including names, geographic data smaller than a state, dates (except year), phone numbers, email addresses, SSNs, medical record numbers, IP addresses, and biometric identifiers. Configure your pipeline masking to cover all 18 categories. Err on the side of over-masking - it is far cheaper to unmask a field later than to explain a HIPAA violation to OCR.

Audit Trails

SOC 2 and similar frameworks require demonstrating that masking controls are in place and functioning. Your pipeline should log every masking operation: which fields were masked, which technique was used, and when. Streamkap’s pipeline observability provides this audit trail out of the box, logging transformation operations as part of its standard monitoring.

Performance Impact

Not all masking operations cost the same. Here is a performance hierarchy, from fastest to slowest:

TechniqueLatency per RecordCPU ImpactExternal Dependencies
Nullification~0NoneNone
Regex redaction1—5 microsecondsMinimalNone
SHA-256 hashing2—10 microsecondsLowNone
HMAC hashing5—15 microsecondsLowKey management
Format-preserving encryption50—200 microsecondsModerateKey management
Tokenization (cached)100—500 microsecondsLowToken vault
Tokenization (uncached)1—10 millisecondsLowToken vault (network I/O)

For a pipeline processing 50,000 records per second, SHA-256 hashing adds approximately 0.5 seconds of cumulative CPU time per second (negligible). Uncached tokenization, however, could add 50—500 seconds of wait time per second, which is unsustainable without batching and caching.

Batching Strategies

For techniques involving external lookups, buffer records into micro-batches of 100—1,000 and perform a single batch lookup. Smaller batches mean lower latency but more overhead; larger batches mean higher throughput but more latency.

Testing Masked Data

Masking that you have not verified is masking you cannot trust.

Verify No PII Leakage

Run automated scans on destination tables to detect PII patterns that should have been masked:

-- Check for unmasked email patterns in a Snowflake destination
SELECT COUNT(*) AS unmasked_emails
FROM customers
WHERE email RLIKE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}';
-- Expected result: 0

If this returns anything greater than zero, your masking has a gap. Run similar checks for SSN patterns, phone numbers, and any other PII format you are masking.

Verify Masking Consistency and Reversibility

For deterministic masking (hashing), confirm that the same input always produces the same output by checking for unexpected hash collisions in destination tables. For tokenization, verify that authorized detokenization produces the correct original value by running a sample of known inputs through the round-trip path.

Common Pitfalls

Masking Only the Primary Path

You mask customers.email, but the same email also lives in orders.billing_email, support_tickets.reporter_email, and a JSON blob inside events.metadata. PII propagates across tables and fields. Audit every column and nested structure, not just the obvious ones.

Forgetting Nested and Semi-Structured Fields

JSON columns, VARIANT fields in Snowflake, and nested structs in Avro/Parquet are particularly dangerous. A masking rule that targets customers.email will not catch an email buried inside events.payload.user.contact.email. Your masking engine must support explicit nested field paths:

transforms:
  - field: "events.payload.user.contact.email"
    type: "sha256_hash"
  - field: "events.payload.user.contact.phone"
    type: "regex_replace"
    match: "(\d{3})-(\d{4})$"
    replace: "***-****"

Assuming Hashing Is Irreversible

SHA-256 is computationally irreversible, but practically reversible for predictable inputs. If you hash a 10-digit phone number without a salt, an attacker can hash all 10 billion possible phone numbers in hours. Always salt or use HMAC for fields with limited cardinality.

Inconsistent Masking Across Destinations

If you mask at the destination level, you are maintaining separate configurations for Snowflake, BigQuery, your data lake, and every other consumer. Inevitably, one drifts and PII leaks. Pipeline-level masking eliminates this risk with a single, consistent set of rules.

Masking in Development but Not Production

Test environments often have masking enabled because they use synthetic data. Production sometimes gets missed when masking was added as an afterthought. Treat masking as a pipeline-level configuration that is promoted through environments, from development to staging to production, with validation at each stage.

Logging Unmasked Data

Your pipeline logs might capture the raw record payload for debugging. If those logs include PII, you have bypassed your entire masking strategy. Ensure that logging and error-handling paths apply the same masking transforms as the main data path, or configure logging to exclude sensitive fields entirely.


Data masking in streaming pipelines is not a feature you bolt on after the fact. It is an architectural decision that should be made when the pipeline is designed, applied consistently at the pipeline level, and validated continuously. The hard part is not the cryptography - it is the discipline of applying masking to every field, in every path, across every destination, and verifying that it works.

Start with a PII audit of your source data. Choose the masking technique that matches your compliance requirements. Apply it at the pipeline level. Test it. Monitor it. Treat any unmasked PII in a destination as a bug, not an oversight.

Ready to protect PII in your streaming pipelines? Streamkap provides built-in field-level masking that applies as data flows from source to destination - no custom code, no destination-specific configuration, and no gaps. Start a free trial and configure your first masking rule in minutes.