<--- Back to all resources
Clickstream Analytics with Flink: Real-Time User Behavior Tracking
Learn how to build real-time clickstream analytics with Apache Flink. Covers sessionization, funnel analysis, page-view aggregations, and powering live dashboards.
Every click, scroll, page view, and button press on your website or application tells a story about what your users want. The problem is that most teams read that story a day late. They run a nightly batch job that sessionizes yesterday’s events, calculates yesterday’s funnels, and produces a dashboard that shows what happened 12 to 36 hours ago.
For product teams trying to react to a broken checkout flow, marketing teams running a time-sensitive campaign, or engineering teams debugging a performance regression, yesterday’s data is not good enough. You need to know what users are doing right now.
Apache Flink processes clickstream events as they arrive, sessionizes them in real time, computes funnel metrics continuously, and feeds live dashboards that update every few seconds. In this guide, we will build a clickstream analytics pipeline from the ground up using Flink SQL, covering event ingestion, sessionization, funnel analysis, page-view aggregations, and dashboard integration.
Clickstream Event Schema
A clickstream pipeline starts with well-structured events. Every user interaction should carry enough context to be useful on its own, without requiring a join to reconstruct meaning:
CREATE TABLE clickstream_events (
user_id STRING,
anonymous_id STRING, -- for logged-out users
event_type STRING, -- 'page_view', 'click', 'scroll', 'form_submit', 'purchase'
page_url STRING,
page_title STRING,
referrer_url STRING,
utm_source STRING,
utm_medium STRING,
utm_campaign STRING,
device_type STRING, -- 'desktop', 'mobile', 'tablet'
browser STRING,
country STRING,
event_time TIMESTAMP(3),
event_properties MAP<STRING, STRING>, -- flexible key-value pairs
WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'clickstream-events',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json',
'scan.startup.mode' = 'latest-offset'
);
The event_properties map gives you flexibility to include event-specific data (product ID for a product view, button label for a click) without expanding the schema for every new event type. The 10-second watermark delay is appropriate for web traffic where events arrive with minimal network delay.
For enrichment, you need a user profile table that Streamkap keeps in sync from your application database:
CREATE TABLE user_profiles (
user_id STRING,
email STRING,
plan_type STRING, -- 'free', 'pro', 'enterprise'
signup_date DATE,
company_name STRING,
PRIMARY KEY (user_id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'users-cdc',
'properties.bootstrap.servers' = 'kafka:9092',
'key.format' = 'json',
'value.format' = 'json'
);
Sessionization
Sessionization is the process of grouping a sequence of user events into logical visits. A session starts when a user arrives and ends when they leave or go inactive for a defined period. Most analytics tools define the session gap as 30 minutes of inactivity, but you should choose a gap that reflects your product’s usage patterns.
Flink’s session windows handle this natively:
CREATE VIEW user_sessions AS
SELECT
COALESCE(user_id, anonymous_id) AS effective_user_id,
SESSION_START(event_time, INTERVAL '30' MINUTE) AS session_start,
SESSION_END(event_time, INTERVAL '30' MINUTE) AS session_end,
COUNT(*) AS event_count,
COUNT(DISTINCT page_url) AS pages_viewed,
MIN(event_time) AS first_event,
MAX(event_time) AS last_event,
TIMESTAMPDIFF(SECOND,
MIN(event_time),
MAX(event_time)) AS session_duration_seconds,
FIRST_VALUE(referrer_url) AS landing_referrer,
FIRST_VALUE(page_url) AS landing_page,
LAST_VALUE(page_url) AS exit_page,
FIRST_VALUE(utm_source) AS utm_source,
FIRST_VALUE(utm_campaign) AS utm_campaign,
FIRST_VALUE(device_type) AS device_type
FROM clickstream_events
GROUP BY
COALESCE(user_id, anonymous_id),
SESSION(event_time, INTERVAL '30' MINUTE);
This single query produces a rich session object that includes: the number of events and distinct pages viewed, session duration, the landing page and referrer (how the user arrived), the exit page (where they left), UTM attribution parameters, and device type. Each of these fields is useful for a different analytics question.
The COALESCE(user_id, anonymous_id) expression handles both logged-in and anonymous users. For anonymous users, you track by the anonymous identifier assigned by your analytics SDK. When a user logs in mid-session, you would need a separate identity resolution step to stitch the anonymous and authenticated identities together.
Page-View Aggregations
Product and marketing teams want to know which pages are popular, which are growing, and which are losing traffic. Tumbling windows give you fixed-interval aggregates:
CREATE VIEW page_views_per_minute AS
SELECT
page_url,
page_title,
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
COUNT(*) AS view_count,
COUNT(DISTINCT COALESCE(user_id, anonymous_id)) AS unique_visitors
FROM clickstream_events
WHERE event_type = 'page_view'
GROUP BY
page_url,
page_title,
TUMBLE(event_time, INTERVAL '1' MINUTE);
For dashboard trend lines, sliding windows produce smoother output. This query computes a rolling 15-minute view count, updated every minute:
CREATE VIEW page_views_rolling AS
SELECT
page_url,
HOP_START(event_time, INTERVAL '1' MINUTE, INTERVAL '15' MINUTE) AS window_start,
COUNT(*) AS view_count,
COUNT(DISTINCT COALESCE(user_id, anonymous_id)) AS unique_visitors
FROM clickstream_events
WHERE event_type = 'page_view'
GROUP BY
page_url,
HOP(event_time, INTERVAL '1' MINUTE, INTERVAL '15' MINUTE);
You can break these aggregations down further by device type, country, or UTM source to answer questions like “Is our mobile traffic growing?” or “How much traffic is this campaign driving right now?”
CREATE VIEW traffic_by_source AS
SELECT
utm_source,
utm_campaign,
device_type,
TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
COUNT(*) AS event_count,
COUNT(DISTINCT COALESCE(user_id, anonymous_id)) AS unique_users
FROM clickstream_events
WHERE event_type = 'page_view'
GROUP BY
utm_source,
utm_campaign,
device_type,
TUMBLE(event_time, INTERVAL '5' MINUTE);
Funnel Analysis
Funnel analysis measures how users progress through a defined sequence of steps. An e-commerce funnel might be: homepage, product page, add to cart, checkout, and purchase. A SaaS signup funnel might be: landing page, pricing page, signup form, email verification, and first login.
In batch systems, funnel analysis requires scanning all of a user’s events, ordering them by time, and checking whether the steps occurred in sequence. Flink’s MATCH_RECOGNIZE clause does this on the stream:
CREATE VIEW checkout_funnel AS
SELECT
user_id,
product_view_time,
add_to_cart_time,
checkout_time,
purchase_time,
TIMESTAMPDIFF(SECOND, product_view_time, purchase_time) AS funnel_duration_seconds
FROM clickstream_events
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY event_time
MEASURES
A.event_time AS product_view_time,
B.event_time AS add_to_cart_time,
C.event_time AS checkout_time,
D.event_time AS purchase_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B C D) WITHIN INTERVAL '1' HOUR
DEFINE
A AS A.event_type = 'page_view'
AND A.page_url LIKE '%/product/%',
B AS B.event_type = 'click'
AND B.event_properties['action'] = 'add_to_cart',
C AS C.event_type = 'page_view'
AND C.page_url LIKE '%/checkout%',
D AS D.event_type = 'purchase'
) AS funnel;
This pattern looks for four events in sequence: a product page view, an add-to-cart click, a checkout page view, and a purchase event, all within one hour. When a user completes the full sequence, Flink emits a match with the timestamps of each step.
To compute conversion rates at each step, you need to count how many users reached each step. Here is a simpler approach using windowed aggregation:
CREATE VIEW funnel_step_counts AS
SELECT
TUMBLE_START(event_time, INTERVAL '1' HOUR) AS window_start,
COUNT(DISTINCT CASE
WHEN event_type = 'page_view' AND page_url LIKE '%/product/%'
THEN user_id END) AS product_viewers,
COUNT(DISTINCT CASE
WHEN event_properties['action'] = 'add_to_cart'
THEN user_id END) AS cart_adders,
COUNT(DISTINCT CASE
WHEN event_type = 'page_view' AND page_url LIKE '%/checkout%'
THEN user_id END) AS checkout_starters,
COUNT(DISTINCT CASE
WHEN event_type = 'purchase'
THEN user_id END) AS purchasers
FROM clickstream_events
GROUP BY TUMBLE(event_time, INTERVAL '1' HOUR);
This gives you hourly counts of users at each funnel step. The ratio between consecutive steps is your conversion rate. You can visualize this as a funnel chart on a dashboard that updates every hour.
Enriching Clickstream with User Data
Raw clickstream events tell you what happened. Enriched events tell you who did it and why it matters. Joining clickstream data with user profiles lets you segment your analytics by plan type, company size, or signup cohort:
CREATE VIEW enriched_clickstream AS
SELECT
c.*,
u.plan_type,
u.signup_date,
u.company_name,
DATEDIFF(DAY, u.signup_date, CAST(c.event_time AS DATE)) AS days_since_signup
FROM clickstream_events c
LEFT JOIN user_profiles u
ON c.user_id = u.user_id;
With enriched data, you can answer questions that raw clickstream cannot:
- “Are enterprise users engaging with our new feature page more than free users?”
- “Do users who signed up in the last 7 days have higher bounce rates?”
- “Which companies are spending the most time on our pricing page?”
The LEFT JOIN ensures you keep events from anonymous users who have no profile match. Streamkap’s CDC streaming keeps the user_profiles table current, so this join always reflects the latest user data without requiring a batch sync.
Real-Time Dashboard Sinks
The final step is getting aggregated metrics to a place where your dashboards can read them. There are two main approaches:
Direct database writes. Write aggregates to PostgreSQL or ClickHouse, and point your dashboard tool (Grafana, Metabase, Superset) at the database:
CREATE TABLE session_metrics_sink (
window_start TIMESTAMP(3),
total_sessions BIGINT,
avg_session_duration DOUBLE,
avg_pages_per_session DOUBLE,
bounce_rate DOUBLE,
PRIMARY KEY (window_start) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:postgresql://analytics-db:5432/clickstream',
'table-name' = 'session_metrics',
'driver' = 'org.postgresql.Driver'
);
INSERT INTO session_metrics_sink
SELECT
session_start AS window_start,
COUNT(*) AS total_sessions,
AVG(session_duration_seconds) AS avg_session_duration,
AVG(pages_viewed) AS avg_pages_per_session,
CAST(SUM(CASE WHEN pages_viewed = 1 THEN 1 ELSE 0 END) AS DOUBLE)
/ COUNT(*) AS bounce_rate
FROM user_sessions
GROUP BY session_start;
Kafka topic output. Write to a Kafka topic and have your dashboard tool consume from it directly. This works well with tools that support streaming data sources:
CREATE TABLE funnel_metrics_sink (
window_start TIMESTAMP(3),
product_viewers BIGINT,
cart_adders BIGINT,
checkout_starters BIGINT,
purchasers BIGINT
) WITH (
'connector' = 'kafka',
'topic' = 'funnel-metrics',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json'
);
INSERT INTO funnel_metrics_sink
SELECT * FROM funnel_step_counts;
Handling Scale
Clickstream data is high-volume by nature. A website with 10 million daily active users generating an average of 50 events per session produces 500 million events per day, or roughly 6,000 events per second sustained with spikes much higher.
Several Flink configuration choices matter at this scale:
Parallelism. Set your Kafka consumer parallelism to match the number of Kafka partitions. If your clickstream topic has 32 partitions, run 32 parallel consumer instances.
Key distribution. Partition your Kafka topic by user ID to ensure all events for a given user land on the same partition. This is required for sessionization and funnel analysis to work correctly, since those operations need to see all of a user’s events in order.
State TTL. Session windows and pattern matching accumulate state for every active user. Configure a state TTL (time-to-live) that matches your session gap plus some buffer. For a 30-minute session gap, a 2-hour state TTL ensures completed sessions are cleaned up without losing in-progress ones.
Checkpoint storage. Use an S3 or GCS-compatible object store for checkpoints. For high-volume clickstream workloads, checkpoint sizes can reach tens of gigabytes, and local disk will not cut it.
Streamkap’s managed Flink handles the provisioning and tuning of these infrastructure concerns. You specify your parallelism and state requirements, and Streamkap manages the cluster, checkpoint storage, and resource allocation.
From Clicks to Decisions
Building clickstream analytics on Flink gives your product and marketing teams a live view of user behavior instead of a historical report. Sessionization happens as users browse. Funnel conversion rates update as users complete (or abandon) checkout flows. Page popularity shifts show up in minutes, not in tomorrow’s dashboard.
The technical pattern is consistent across all of these use cases: events flow from your website or application into Kafka, Flink processes them with session windows, pattern matching, and aggregations, and results land in a serving layer that powers dashboards and alerts.
If your team is currently waiting for a batch pipeline to produce yesterday’s clickstream report, start by streaming one metric to a live dashboard. Session counts per minute, or conversion rate per hour, or active users right now. Once your product team sees live data for the first time, the batch pipeline will feel impossibly slow, and you will have the buy-in to move the rest of your analytics to the stream.
Streamkap provides the managed Kafka and Flink infrastructure to run this pipeline at production scale. CDC streaming from your application databases keeps enrichment data current, so your clickstream analytics always reflect the true state of your users, products, and accounts.