<--- Back to all resources

Engineering

February 25, 2026

11 min read

Clickstream Analytics with Flink: Real-Time User Behavior Tracking

Learn how to build real-time clickstream analytics with Apache Flink. Covers sessionization, funnel analysis, page-view aggregations, and powering live dashboards.

TL;DR: • Flink session windows automatically group user clickstream events into logical sessions, replacing fragile batch-based sessionization with a continuous, event-driven approach. • Funnel analysis in Flink SQL uses MATCH_RECOGNIZE to detect multi-step conversion sequences in real time, giving you live conversion rates instead of next-day reports. • Page-view and engagement aggregations with tumbling and sliding windows feed real-time dashboards that update every few seconds. • Streamkap streams CDC data from your application databases into Kafka, so Flink can enrich clickstream events with user profiles, product data, and account information in real time.

Every click, scroll, page view, and button press on your website or application tells a story about what your users want. The problem is that most teams read that story a day late. They run a nightly batch job that sessionizes yesterday’s events, calculates yesterday’s funnels, and produces a dashboard that shows what happened 12 to 36 hours ago.

For product teams trying to react to a broken checkout flow, marketing teams running a time-sensitive campaign, or engineering teams debugging a performance regression, yesterday’s data is not good enough. You need to know what users are doing right now.

Apache Flink processes clickstream events as they arrive, sessionizes them in real time, computes funnel metrics continuously, and feeds live dashboards that update every few seconds. In this guide, we will build a clickstream analytics pipeline from the ground up using Flink SQL, covering event ingestion, sessionization, funnel analysis, page-view aggregations, and dashboard integration.

Clickstream Event Schema

A clickstream pipeline starts with well-structured events. Every user interaction should carry enough context to be useful on its own, without requiring a join to reconstruct meaning:

CREATE TABLE clickstream_events (
  user_id        STRING,
  anonymous_id   STRING,   -- for logged-out users
  event_type     STRING,   -- 'page_view', 'click', 'scroll', 'form_submit', 'purchase'
  page_url       STRING,
  page_title     STRING,
  referrer_url   STRING,
  utm_source     STRING,
  utm_medium     STRING,
  utm_campaign   STRING,
  device_type    STRING,   -- 'desktop', 'mobile', 'tablet'
  browser        STRING,
  country        STRING,
  event_time     TIMESTAMP(3),
  event_properties MAP<STRING, STRING>,  -- flexible key-value pairs
  WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'clickstream-events',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'json',
  'scan.startup.mode' = 'latest-offset'
);

The event_properties map gives you flexibility to include event-specific data (product ID for a product view, button label for a click) without expanding the schema for every new event type. The 10-second watermark delay is appropriate for web traffic where events arrive with minimal network delay.

For enrichment, you need a user profile table that Streamkap keeps in sync from your application database:

CREATE TABLE user_profiles (
  user_id       STRING,
  email         STRING,
  plan_type     STRING,     -- 'free', 'pro', 'enterprise'
  signup_date   DATE,
  company_name  STRING,
  PRIMARY KEY (user_id) NOT ENFORCED
) WITH (
  'connector' = 'upsert-kafka',
  'topic' = 'users-cdc',
  'properties.bootstrap.servers' = 'kafka:9092',
  'key.format' = 'json',
  'value.format' = 'json'
);

Sessionization

Sessionization is the process of grouping a sequence of user events into logical visits. A session starts when a user arrives and ends when they leave or go inactive for a defined period. Most analytics tools define the session gap as 30 minutes of inactivity, but you should choose a gap that reflects your product’s usage patterns.

Flink’s session windows handle this natively:

CREATE VIEW user_sessions AS
SELECT
  COALESCE(user_id, anonymous_id) AS effective_user_id,
  SESSION_START(event_time, INTERVAL '30' MINUTE) AS session_start,
  SESSION_END(event_time, INTERVAL '30' MINUTE)   AS session_end,
  COUNT(*)                                         AS event_count,
  COUNT(DISTINCT page_url)                         AS pages_viewed,
  MIN(event_time)                                  AS first_event,
  MAX(event_time)                                  AS last_event,
  TIMESTAMPDIFF(SECOND,
    MIN(event_time),
    MAX(event_time))                               AS session_duration_seconds,
  FIRST_VALUE(referrer_url)                        AS landing_referrer,
  FIRST_VALUE(page_url)                            AS landing_page,
  LAST_VALUE(page_url)                             AS exit_page,
  FIRST_VALUE(utm_source)                          AS utm_source,
  FIRST_VALUE(utm_campaign)                        AS utm_campaign,
  FIRST_VALUE(device_type)                         AS device_type
FROM clickstream_events
GROUP BY
  COALESCE(user_id, anonymous_id),
  SESSION(event_time, INTERVAL '30' MINUTE);

This single query produces a rich session object that includes: the number of events and distinct pages viewed, session duration, the landing page and referrer (how the user arrived), the exit page (where they left), UTM attribution parameters, and device type. Each of these fields is useful for a different analytics question.

The COALESCE(user_id, anonymous_id) expression handles both logged-in and anonymous users. For anonymous users, you track by the anonymous identifier assigned by your analytics SDK. When a user logs in mid-session, you would need a separate identity resolution step to stitch the anonymous and authenticated identities together.

Page-View Aggregations

Product and marketing teams want to know which pages are popular, which are growing, and which are losing traffic. Tumbling windows give you fixed-interval aggregates:

CREATE VIEW page_views_per_minute AS
SELECT
  page_url,
  page_title,
  TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
  COUNT(*) AS view_count,
  COUNT(DISTINCT COALESCE(user_id, anonymous_id)) AS unique_visitors
FROM clickstream_events
WHERE event_type = 'page_view'
GROUP BY
  page_url,
  page_title,
  TUMBLE(event_time, INTERVAL '1' MINUTE);

For dashboard trend lines, sliding windows produce smoother output. This query computes a rolling 15-minute view count, updated every minute:

CREATE VIEW page_views_rolling AS
SELECT
  page_url,
  HOP_START(event_time, INTERVAL '1' MINUTE, INTERVAL '15' MINUTE) AS window_start,
  COUNT(*) AS view_count,
  COUNT(DISTINCT COALESCE(user_id, anonymous_id)) AS unique_visitors
FROM clickstream_events
WHERE event_type = 'page_view'
GROUP BY
  page_url,
  HOP(event_time, INTERVAL '1' MINUTE, INTERVAL '15' MINUTE);

You can break these aggregations down further by device type, country, or UTM source to answer questions like “Is our mobile traffic growing?” or “How much traffic is this campaign driving right now?”

CREATE VIEW traffic_by_source AS
SELECT
  utm_source,
  utm_campaign,
  device_type,
  TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
  COUNT(*) AS event_count,
  COUNT(DISTINCT COALESCE(user_id, anonymous_id)) AS unique_users
FROM clickstream_events
WHERE event_type = 'page_view'
GROUP BY
  utm_source,
  utm_campaign,
  device_type,
  TUMBLE(event_time, INTERVAL '5' MINUTE);

Funnel Analysis

Funnel analysis measures how users progress through a defined sequence of steps. An e-commerce funnel might be: homepage, product page, add to cart, checkout, and purchase. A SaaS signup funnel might be: landing page, pricing page, signup form, email verification, and first login.

In batch systems, funnel analysis requires scanning all of a user’s events, ordering them by time, and checking whether the steps occurred in sequence. Flink’s MATCH_RECOGNIZE clause does this on the stream:

CREATE VIEW checkout_funnel AS
SELECT
  user_id,
  product_view_time,
  add_to_cart_time,
  checkout_time,
  purchase_time,
  TIMESTAMPDIFF(SECOND, product_view_time, purchase_time) AS funnel_duration_seconds
FROM clickstream_events
MATCH_RECOGNIZE (
  PARTITION BY user_id
  ORDER BY event_time
  MEASURES
    A.event_time AS product_view_time,
    B.event_time AS add_to_cart_time,
    C.event_time AS checkout_time,
    D.event_time AS purchase_time
  ONE ROW PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A B C D) WITHIN INTERVAL '1' HOUR
  DEFINE
    A AS A.event_type = 'page_view'
         AND A.page_url LIKE '%/product/%',
    B AS B.event_type = 'click'
         AND B.event_properties['action'] = 'add_to_cart',
    C AS C.event_type = 'page_view'
         AND C.page_url LIKE '%/checkout%',
    D AS D.event_type = 'purchase'
) AS funnel;

This pattern looks for four events in sequence: a product page view, an add-to-cart click, a checkout page view, and a purchase event, all within one hour. When a user completes the full sequence, Flink emits a match with the timestamps of each step.

To compute conversion rates at each step, you need to count how many users reached each step. Here is a simpler approach using windowed aggregation:

CREATE VIEW funnel_step_counts AS
SELECT
  TUMBLE_START(event_time, INTERVAL '1' HOUR) AS window_start,
  COUNT(DISTINCT CASE
    WHEN event_type = 'page_view' AND page_url LIKE '%/product/%'
    THEN user_id END) AS product_viewers,
  COUNT(DISTINCT CASE
    WHEN event_properties['action'] = 'add_to_cart'
    THEN user_id END) AS cart_adders,
  COUNT(DISTINCT CASE
    WHEN event_type = 'page_view' AND page_url LIKE '%/checkout%'
    THEN user_id END) AS checkout_starters,
  COUNT(DISTINCT CASE
    WHEN event_type = 'purchase'
    THEN user_id END) AS purchasers
FROM clickstream_events
GROUP BY TUMBLE(event_time, INTERVAL '1' HOUR);

This gives you hourly counts of users at each funnel step. The ratio between consecutive steps is your conversion rate. You can visualize this as a funnel chart on a dashboard that updates every hour.

Enriching Clickstream with User Data

Raw clickstream events tell you what happened. Enriched events tell you who did it and why it matters. Joining clickstream data with user profiles lets you segment your analytics by plan type, company size, or signup cohort:

CREATE VIEW enriched_clickstream AS
SELECT
  c.*,
  u.plan_type,
  u.signup_date,
  u.company_name,
  DATEDIFF(DAY, u.signup_date, CAST(c.event_time AS DATE)) AS days_since_signup
FROM clickstream_events c
LEFT JOIN user_profiles u
  ON c.user_id = u.user_id;

With enriched data, you can answer questions that raw clickstream cannot:

  • “Are enterprise users engaging with our new feature page more than free users?”
  • “Do users who signed up in the last 7 days have higher bounce rates?”
  • “Which companies are spending the most time on our pricing page?”

The LEFT JOIN ensures you keep events from anonymous users who have no profile match. Streamkap’s CDC streaming keeps the user_profiles table current, so this join always reflects the latest user data without requiring a batch sync.

Real-Time Dashboard Sinks

The final step is getting aggregated metrics to a place where your dashboards can read them. There are two main approaches:

Direct database writes. Write aggregates to PostgreSQL or ClickHouse, and point your dashboard tool (Grafana, Metabase, Superset) at the database:

CREATE TABLE session_metrics_sink (
  window_start         TIMESTAMP(3),
  total_sessions       BIGINT,
  avg_session_duration DOUBLE,
  avg_pages_per_session DOUBLE,
  bounce_rate          DOUBLE,
  PRIMARY KEY (window_start) NOT ENFORCED
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:postgresql://analytics-db:5432/clickstream',
  'table-name' = 'session_metrics',
  'driver' = 'org.postgresql.Driver'
);

INSERT INTO session_metrics_sink
SELECT
  session_start AS window_start,
  COUNT(*) AS total_sessions,
  AVG(session_duration_seconds) AS avg_session_duration,
  AVG(pages_viewed) AS avg_pages_per_session,
  CAST(SUM(CASE WHEN pages_viewed = 1 THEN 1 ELSE 0 END) AS DOUBLE)
    / COUNT(*) AS bounce_rate
FROM user_sessions
GROUP BY session_start;

Kafka topic output. Write to a Kafka topic and have your dashboard tool consume from it directly. This works well with tools that support streaming data sources:

CREATE TABLE funnel_metrics_sink (
  window_start      TIMESTAMP(3),
  product_viewers   BIGINT,
  cart_adders       BIGINT,
  checkout_starters BIGINT,
  purchasers        BIGINT
) WITH (
  'connector' = 'kafka',
  'topic' = 'funnel-metrics',
  'properties.bootstrap.servers' = 'kafka:9092',
  'format' = 'json'
);

INSERT INTO funnel_metrics_sink
SELECT * FROM funnel_step_counts;

Handling Scale

Clickstream data is high-volume by nature. A website with 10 million daily active users generating an average of 50 events per session produces 500 million events per day, or roughly 6,000 events per second sustained with spikes much higher.

Several Flink configuration choices matter at this scale:

Parallelism. Set your Kafka consumer parallelism to match the number of Kafka partitions. If your clickstream topic has 32 partitions, run 32 parallel consumer instances.

Key distribution. Partition your Kafka topic by user ID to ensure all events for a given user land on the same partition. This is required for sessionization and funnel analysis to work correctly, since those operations need to see all of a user’s events in order.

State TTL. Session windows and pattern matching accumulate state for every active user. Configure a state TTL (time-to-live) that matches your session gap plus some buffer. For a 30-minute session gap, a 2-hour state TTL ensures completed sessions are cleaned up without losing in-progress ones.

Checkpoint storage. Use an S3 or GCS-compatible object store for checkpoints. For high-volume clickstream workloads, checkpoint sizes can reach tens of gigabytes, and local disk will not cut it.

Streamkap’s managed Flink handles the provisioning and tuning of these infrastructure concerns. You specify your parallelism and state requirements, and Streamkap manages the cluster, checkpoint storage, and resource allocation.

From Clicks to Decisions

Building clickstream analytics on Flink gives your product and marketing teams a live view of user behavior instead of a historical report. Sessionization happens as users browse. Funnel conversion rates update as users complete (or abandon) checkout flows. Page popularity shifts show up in minutes, not in tomorrow’s dashboard.

The technical pattern is consistent across all of these use cases: events flow from your website or application into Kafka, Flink processes them with session windows, pattern matching, and aggregations, and results land in a serving layer that powers dashboards and alerts.

If your team is currently waiting for a batch pipeline to produce yesterday’s clickstream report, start by streaming one metric to a live dashboard. Session counts per minute, or conversion rate per hour, or active users right now. Once your product team sees live data for the first time, the batch pipeline will feel impossibly slow, and you will have the buy-in to move the rest of your analytics to the stream.

Streamkap provides the managed Kafka and Flink infrastructure to run this pipeline at production scale. CDC streaming from your application databases keeps enrichment data current, so your clickstream analytics always reflect the true state of your users, products, and accounts.