<--- Back to all resources

Engineering

February 25, 2026

9 min read

Computed Columns in Streaming: Deriving New Fields On-the-Fly

Learn how to add computed columns to streaming data - derived fields, calculations, lookups, and business logic applied in real time as data flows through your pipeline.

TL;DR: • Computed columns are new fields derived from existing data - calculations, concatenations, lookups, and business logic applied in real time as records flow through the pipeline. • Common use cases include adding full_name from first_name + last_name, calculating order_total from quantity * price, and deriving region from country codes. • Computing columns in the pipeline (not the destination) ensures every consumer gets the enriched data without duplicating transformation logic.

Every data team has been there: you write a query that concatenates first_name and last_name into full_name, multiplies quantity by unit_price to get line_total, and wraps a country code in a CASE statement to derive a region. Then you copy that same logic into the next query, and the next dashboard, and the next microservice. Before long, the same derivation lives in dozens of places, each one a potential source of drift.

Computed columns solve this problem by moving derived field logic into the streaming pipeline itself. Instead of repeating expressions at the point of consumption, you define them once where data is in motion. Every downstream consumer - a warehouse, a real-time dashboard, an operational application - receives the enriched record with no additional work.

This guide covers the types of computed columns you can add, how to express them in Flink SQL, common use cases, and how to decide whether a computation belongs in the pipeline or in the destination.

Types of Computed Columns

Computed columns fall into several categories based on the kind of expression they evaluate. Understanding each type helps you choose the right approach for a given field.

Arithmetic

The simplest computed columns apply math to numeric source fields. Multiplying, dividing, adding, and subtracting are all valid operations.

SELECT
  order_id,
  quantity,
  unit_price,
  quantity * unit_price AS line_total,
  quantity * unit_price * 0.08 AS estimated_tax
FROM orders_stream;

String Manipulation

Concatenation, trimming, case conversion, and substring extraction let you reshape text fields. This is especially common when source systems store names, addresses, or identifiers in fragments.

SELECT
  customer_id,
  UPPER(TRIM(first_name)) || ' ' || UPPER(TRIM(last_name)) AS full_name,
  LOWER(email) AS normalized_email,
  SUBSTRING(phone_number FROM 1 FOR 3) AS area_code
FROM customers_stream;

Date and Time Extraction

Temporal fields often need to be broken into components - year, month, day of week - for downstream grouping and filtering.

SELECT
  order_id,
  order_timestamp,
  EXTRACT(YEAR FROM order_timestamp) AS order_year,
  EXTRACT(MONTH FROM order_timestamp) AS order_month,
  DAYOFWEEK(order_timestamp) AS order_day_of_week,
  TIMESTAMPDIFF(DAY, customer_since, order_timestamp) AS customer_tenure_days
FROM orders_stream;

Conditional Logic (CASE)

CASE expressions map raw values to human-readable labels or business categories. They are the streaming equivalent of a lookup table baked into the query.

SELECT
  order_id,
  status_code,
  CASE status_code
    WHEN 1 THEN 'pending'
    WHEN 2 THEN 'processing'
    WHEN 3 THEN 'shipped'
    WHEN 4 THEN 'delivered'
    WHEN 5 THEN 'cancelled'
    ELSE 'unknown'
  END AS status_label
FROM orders_stream;

Type Conversion

Source systems sometimes emit numeric IDs as strings, or timestamps as epoch integers. CAST expressions normalize these before they reach the destination.

SELECT
  CAST(user_id AS BIGINT) AS user_id,
  CAST(event_timestamp_ms AS TIMESTAMP(3)) AS event_timestamp,
  CAST(amount_cents AS DECIMAL(10, 2)) / 100 AS amount_dollars
FROM events_stream;

Common Use Cases

Computed columns show up in nearly every streaming pipeline. Here are the patterns that appear most often.

Full name concatenation. Combine first_name and last_name so downstream reports never need to reconstruct the value.

Order totals. Multiply quantity by unit_price and optionally apply tax or discount rates to produce line_total and net_total in a single pass.

Age from birthdate. Calculate TIMESTAMPDIFF(YEAR, birth_date, CURRENT_DATE) to produce an integer age field that updates as records flow.

Category mapping. Translate numeric codes or short abbreviations into readable labels using CASE. Product category IDs, status codes, and priority levels are all good candidates.

Region derivation. Map country_code to a geographic region so analysts can group by region without maintaining a separate lookup table in every BI tool.

SELECT
  customer_id,
  country_code,
  CASE
    WHEN country_code IN ('US', 'CA', 'MX') THEN 'North America'
    WHEN country_code IN ('GB', 'DE', 'FR', 'ES', 'IT') THEN 'Europe'
    WHEN country_code IN ('JP', 'KR', 'SG', 'AU') THEN 'APAC'
    ELSE 'Other'
  END AS region
FROM customers_stream;

Computed Columns from CDC Data

Change Data Capture streams carry operational metadata that is often useful to preserve or transform. Adding computed columns to CDC records lets you annotate each row with context about how it changed, not just what it contains.

SELECT
  *,
  CASE __op
    WHEN 'c' THEN 'insert'
    WHEN 'u' THEN 'update'
    WHEN 'd' THEN 'delete'
    WHEN 'r' THEN 'snapshot'
    ELSE 'unknown'
  END AS _operation_type,
  TO_TIMESTAMP_LTZ(__source_ts_ms, 3) AS _change_timestamp,
  __table AS _source_table,
  __db AS _source_database
FROM cdc_orders_stream;

These metadata columns make it straightforward to build audit logs, detect deletes in append-only destinations, and filter records by operation type - all without additional processing after the data lands.

Complex Expressions

Simple expressions cover the majority of use cases, but some business logic requires nesting and fallback handling.

Nested CASE

When categorization depends on multiple fields, nest your CASE statements or combine conditions with AND/OR.

SELECT
  order_id,
  order_total,
  customer_type,
  CASE
    WHEN order_total >= 1000 AND customer_type = 'enterprise' THEN 'high_value_enterprise'
    WHEN order_total >= 1000 AND customer_type = 'smb' THEN 'high_value_smb'
    WHEN order_total >= 100 THEN 'mid_value'
    ELSE 'low_value'
  END AS order_segment
FROM orders_stream;

COALESCE for Defaults

Null fields are a constant source of downstream errors. COALESCE lets you provide a sensible default without losing the original value when it is present.

SELECT
  customer_id,
  COALESCE(preferred_name, first_name, 'Unknown') AS display_name,
  COALESCE(shipping_address, billing_address) AS delivery_address,
  COALESCE(discount_pct, 0) AS effective_discount_pct
FROM customers_stream;

Regex Extraction

When source fields embed structured data inside free-text strings - email domains, URL paths, log messages - regex extraction pulls out the relevant piece.

SELECT
  user_id,
  email,
  REGEXP_EXTRACT(email, '@(.+)$', 1) AS email_domain,
  REGEXP_EXTRACT(page_url, '/products/([^/]+)', 1) AS product_slug
FROM clickstream;

Performance Considerations

Computed columns that use built-in scalar functions - arithmetic, string manipulation, CASE, CAST - add negligible overhead. These are CPU-bound operations that complete in microseconds per record, and modern streaming engines process millions of such evaluations per second without measurable latency impact.

Things to watch for:

  • UDFs (User-Defined Functions): Custom functions that call external services, perform complex parsing, or allocate significant memory can become a bottleneck. Profile UDFs separately and consider caching results if the function is deterministic.
  • Regex complexity: Simple patterns are fast. Patterns with excessive backtracking (nested quantifiers, ambiguous alternations) can slow down processing. Test regex performance on representative data before deploying.
  • Type conversions on high-cardinality strings: Casting very long strings to other types or running repeated substring operations on large text fields consumes more CPU than the same operation on short fields.

As a general rule: if your expression could run in a SQL SELECT without causing a table scan to slow down, it will run just as cheaply in a streaming context.

Practical Example: Enriching an Order Stream

Consider an orders table captured via CDC. The raw record contains order_id, customer_id, status_code, quantity, unit_price, country_code, and order_timestamp. A single computed-column query can produce a fully enriched record.

SELECT
  order_id,
  customer_id,
  quantity,
  unit_price,
  quantity * unit_price AS line_total,
  quantity * unit_price * (1 + COALESCE(tax_rate, 0)) AS total_with_tax,

  CASE status_code
    WHEN 1 THEN 'pending'
    WHEN 2 THEN 'processing'
    WHEN 3 THEN 'shipped'
    WHEN 4 THEN 'delivered'
    WHEN 5 THEN 'cancelled'
    ELSE 'unknown'
  END AS order_status_label,

  CASE
    WHEN country_code IN ('US', 'CA', 'MX') THEN 'North America'
    WHEN country_code IN ('GB', 'DE', 'FR', 'ES', 'IT') THEN 'Europe'
    WHEN country_code IN ('JP', 'KR', 'SG', 'AU') THEN 'APAC'
    ELSE 'Other'
  END AS customer_region,

  EXTRACT(MONTH FROM order_timestamp) AS order_month,
  EXTRACT(YEAR FROM order_timestamp) AS order_year,

  CASE __op
    WHEN 'c' THEN 'insert'
    WHEN 'u' THEN 'update'
    WHEN 'd' THEN 'delete'
    ELSE 'snapshot'
  END AS _operation_type
FROM orders_cdc_stream;

Every record that arrives at the destination now carries line_total, total_with_tax, order_status_label, customer_region, order_month, order_year, and _operation_type - all without a single transformation in the warehouse.

Testing Computed Columns

Streaming transformations deserve the same test discipline as application code. A few practical techniques help catch errors before they reach production.

Sample-based validation. Run the query against a small batch of known records and compare the computed output to expected values. This catches off-by-one errors in CASE mappings and incorrect arithmetic.

Null handling tests. Explicitly send records with null values in every source field referenced by a computed column. Verify that COALESCE defaults apply correctly and that arithmetic on nulls does not silently produce null totals.

Boundary conditions. Test edge cases: zero quantities, negative prices, empty strings, timestamps at epoch zero, and country codes not covered by any CASE branch. Confirm that the ELSE clause or default logic produces a sensible result.

Schema evolution. If a source field is renamed or removed, the computed column expression will fail. Include a check in your deployment pipeline that validates computed column expressions against the current source schema before promoting changes.

Platforms like Streamkap let you preview transformation output before activating a pipeline, which shortens the feedback loop considerably.

When to Compute in Pipeline vs Destination

Not every derived field belongs in the pipeline. Here is a simple decision framework.

Compute in the pipeline when:

  • Multiple consumers need the same derived field.
  • The expression depends only on fields already present in the record (no joins to external tables).
  • You want to reduce query complexity and cost in the destination.
  • The derived field is part of a CDC audit trail (_operation_type, _change_timestamp).

Compute in the destination when:

  • The derivation requires joining to a large, frequently changing dimension table that is impractical to embed in the stream.
  • The logic is exploratory and changes often - analysts experimenting with new segments or scores.
  • The computation is destination-specific (e.g., a Snowflake-specific UDF that uses proprietary functionality).

Hybrid approach. Compute stable, widely-used fields in the pipeline. Leave experimental or destination-specific derivations for views or materialized columns in the warehouse. This balances consistency with flexibility.

The goal is to enrich data as early as practical without coupling the pipeline to business logic that changes weekly. Stable derivations - name concatenation, status labels, region mapping, line totals - almost always belong in the stream. Volatile derivations - scoring models, A/B test bucketing, ad-hoc segmentation - are better left closer to the consumer.

By treating computed columns as a first-class part of your streaming architecture, you reduce duplication, improve consistency, and give every downstream system a richer, more useful record from the moment it arrives.