/ SQL

Quick guide: Calculate Cohort Retention Analysis with SQL

Cohort Retention Analysis is a powerful thing that most business owners need to look at.

Retention: If first-time user A goes to the store on Week 1, and returns to the store the next week, he is a returned user. If user B also goes to the store on Week 1 and does not return the next week, he's a bounced user, that basically means you lose him as a user (despite all the marketing money you spent to lure him to your store).

Cohort Analysis is a technique to see how variables change in different groups given different starting conditions (a cohort).

Defining Requirements

In this post, we'll show you how to build the below Cohort Retention:

The above chart tells you a few things:

  • We cohort users by their signup month. This is the users table below.
  • For each cohort, we show the population (how many people) in that cohort in the second column
  • An activity is an action user did to the site (this could be simply view the page, or as specific as making a purchase). We denote this as table activities below.

We need the following tables and fields:

users:
- id
- timestamp

activities:
- user_id
- timestamp

Bucketing Users Into Cohort

First we bucket them into different cohort by their sign up month, and store into cohort_items

-- (user_id, cohort_month), each 
with cohort_items as (
  select
    date_trunc('month', U.timestamp)::date as cohort_month,
    id as user_id
  from users U
  order by 1, 2
)

After that, we build user_activities which

-- (user_id, month_number): user X has activity in month number X
WITH user_activities as (
  select
    A.user_id,
    MONTH_DIFF(
      date_trunc('month', A.timestamp)::date,
      C.cohort_month
    ) as month_number
  from public.activities A
  left join cohort_items C ON A.user_id = C.user_id
  group by 1, 2
)

The above would return all the pairs of (user_id, month_number) that indicates if a user is active in that month after their original signup date.

| user | month_number |
| Alex | 0            |
| Alex | 1            |
| Bob  | 0            |
| Bob  | 2            |

The MONTH_DIFF is a user-defined function that takes in 2 dates, and return the number of months between them.

Cohort Size: is simply how many users are in each group

-- (cohort_month, size)
with cohort_size as (
  select cohort_month, count(1) as num_users
  from cohort_items
  group by 1
  order by 1
),

And finally, putting them together with the below:

-- (cohort_month, month_number, cnt)
with retention_table as (
  select
    C.cohort_month,
    A.month_number,
    count(1) as num_users
  from user_activities A
  left join cohort_items C ON A.user_id = C.user_id
  group by 1, 2
)
-- our final value: (cohort_month, size, month_number, percentage)
select
  B.cohort_month,
  S.num_users as total_users,
  B.month_number,
  B.num_users::float * 100 / S.num_users as percentage
from retention_table B
left join cohort_size S ON B.cohort_month = S.cohort_month
order by 1, 3
where B.cohort_month IS NOT NULL

And you would get:

With Holistics, we can automatically turn them into beautiful cohort table like so:

The Full SQL

Here's the full SQL for your reference. Note: you just need to change the table names and column names inside the
first 2 WITH clauses, and the rest should work as per normal. [View the full SQL (https://gist.github.com/nvquanghuy/bd0fda7b88e5b2fd8e46e047e391d25f)

— — — — — — — — — — — —

Having problems finding a simple and affordable data reporting system for your startups? Check us out at holistics.io.

Huy Nguyen

Huy Nguyen

Original creator and cofounder of Holistics, a data platform for tech companies. Holistics’ customers are tech startups like Grab, Traveloka, ShopBack, 99co, Tech In Asia and alike.

Read More