Creating a Journey Map for Visitor Cohorts--a recipe


#1

Summary
Based off of this original Snowplow team recipe, http://snowplowanalytics.com/guides/recipes/catalog-analytics/measuring-and-comparing-content-page-performance.html, which plots out a journey map for a single user, I’ve created a similar visualization that illustrates an ‘average journey’ for a cohort of visitors. In my example, I’ve filtered the data so that only data from a group of people that purchased a certain product on their first visit was analyzed. Feel free to reuse and tweak as you like!

Sample Visualization

Understanding the Visualization:
Here, each URL is represented by a row (y-axis) on the Gantt chart. The start of each bar represents the average start time that the cohort began viewing this URL (in seconds) from the time their session started. The X axis represents the amount of seconds that has passed since the session start. The opacity of each bar represents the number of users that traversed that URL path. That is, bars that are darker were more heavily traversed than URLs with lighter bars.

Methodology:
Using Redshift, I identified the cohort of users of interest based on the products they purchased, and by using the session start time joined up the pageview and ping events for all of the URLs that this cohort viewed. In addition, I also calculate the number of cohort users visited a specific URL path. After an export into R, I do some simple data cleaning and create the visualization that’s shown above.

Redshift SQL:

--The following query calcualtes the average start and end times for pageviews and ping events for URLs of a specifically defined user cohort
--In the case when a user has many visits, the time between those visits is effectively removed.  That is, if there was a 1 hour gap between session 1 and session 2, the last event of session 1 will appear directly before the first event of session 2, without a gap
select
  userSessionURLTotals.domain_sessionidx,
  lower(userSessionURLTotals.page_urlpath) as page_urlpath,
      --USE this in place of calculated page_urlpath above if you'd like to only consider html files and not paths  substring(userSessionURLTotals.page_urlpath, (len(userSessionURLTotals.page_urlpath))- position('/' in reverse(lower(userSessionURLTotals.page_urlpath))) + 1,len(userSessionURLTotals.page_urlpath)) as page_urlpath4,
  avg(userSessionURLTotals.AdjmintimeFromSessionStart) as avgMinTimeFromStartSession,
  avg(userSessionURLTotals.AdjmaxtimeFromSessionStart) as avgMinTimeFromEndSession,
  avg(AdjTotalTime) as avgTotalTime,
  count(distinct userSessionURLTotals.domain_userid) as urlUserCount,
  totalUserCount,
  cast(count(distinct userSessionURLTotals.domain_userid)as float) / cast(totalUserCount as float) as userRatio
from
(
     select
        sessions.domain_userid,
        sessions.domain_sessionidx,
        userEnteranceStartTime,
        sessions.sessionStartTime,
        sessions.sessionEndTime,
        sessions.totalSessionTime,
        previousSessions.runTotal,
        urls.page_urlpath,
        urls.mintstamp,
        urls.maxtstamp,
        urls.totaltime,
        datediff(second, sessionStartTime, urls.mintstamp) mintimeFromSessionStart,
        datediff(second, sessionStartTime, urls.maxtstamp) maxtimeFromSessionStart,
        datediff(second, sessionStartTime, urls.mintstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end as AdjmintimeFromSessionStart,
        datediff(second, sessionStartTime, urls.maxtstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end as  AdjmaxtimeFromSessionStart,
        (datediff(second, sessionStartTime, urls.maxtstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end) - (datediff(second, sessionStartTime, urls.mintstamp) + case when previousSessions.runTotal is null then 0 else previousSessions.runTotal end) as AdjTotalTime,
        datediff(second, userEnteranceStartTime, urls.mintstamp) mintimeFromUserEnter,
        datediff(second, userEnteranceStartTime, urls.maxtstamp) maxtimeFromUserEnter,
        custCohort.totalUserCount
    from 
    (
        select 
          user_id,
          domain_userid,
          purchaseTstamp,
          UserSeqOrderNumber,
          domain_sessionidx,
          sum(count(distinct domain_userid)) over () as totalUserCount
        from
        (
             select 
                user_id,
                domain_userid,
                tr_orderid,
                domain_sessionidx,
                collector_tstamp as purchaseTstamp,
                rank() over (partition by user_id order by collector_tstamp) as UserSeqOrderNumber
              from atomic.events
              where
                event_name='transaction'
                and user_id is not null
                and tr_orderid is not null
                --and domain_sessionidx = 1 --uncomment this line if you'd like to filter for a specific number of visits
              group by
                user_id,
                domain_userid,
                tr_orderid,
                domain_sessionidx,
                collector_tstamp
        ) CustBase
        join 
        (--selects all items that were a part of the order
               select
                ti_orderid,
                LISTAGG(ti_sku,';') as SKUsPurchased,
                LISTAGG(ti_quantity,';') as SKUQuantities,
                LISTAGG(ti_price,';') as SKUListPrices,
                LISTAGG(replace(ti_name,',',''),';') as SKUNames,
                sum(ti_quantity) as totalQuantity,
                sum(ti_price*ti_quantity) as totalSKUsPrice
              from
              (
                select *
                from atomic.events e
                where event_name='transaction_item'
                order by ti_sku
              )

              group by
                ti_orderid
        )items on items.ti_orderid = CustBase.tr_orderid
        where 
           UserSeqOrderNumber =1 --limiting users to their first purchase only
        --FILTERING FOR PRODUCT to define COHORT
        and
        (
          SKUsPurchased like '%PART_NUM1%' 
          or SKUsPurchased like '%PART_NUM2%' 
          --add additional product filters when necessary
          )
    group by
      user_id,
      domain_userid,
      purchaseTstamp,
      UserSeqOrderNumber,
      domain_sessionidx
    )custCohort
    join
    --get the session information
    (
      select distinct
        domain_userid,
        domain_sessionidx,
        min(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionStartTime,
        max(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionEndTime,
        min(collector_tstamp) over (partition by domain_userid ) as userEnteranceStartTime,
        datediff(second,   min(collector_tstamp) over (partition by domain_userid, domain_sessionidx) ,  max(collector_tstamp) over (partition by domain_userid, domain_sessionidx )) totalSessionTime
      FROM "atomic".events
    )sessions on sessions.domain_userid = custCohort.domain_userid and sessions.domain_sessionidx <= custCohort.domain_sessionidx
    left join
    (
    --getting previous session's time in seconds
        select distinct
          domain_userid,
          domain_sessionidx as PreviousSessionID,
          domain_sessionidx +1 as nextSessionIDX,
          sum(totalSessionTime) over (partition by domain_userid order by domain_sessionidx ROWS UNBOUNDED PRECEDING  ) runTotal
      from
      (
      select distinct
          domain_userid,
          domain_sessionidx,
          min(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionStartTime,
          max(collector_tstamp) over (partition by domain_userid, domain_sessionidx ) as sessionEndTime,
          min(collector_tstamp) over (partition by domain_userid ) as userEnteranceStartTime,
          datediff(second,   min(collector_tstamp) over (partition by domain_userid, domain_sessionidx) ,  max(collector_tstamp) over (partition by domain_userid, domain_sessionidx )) totalSessionTime
        FROM "atomic".events
      ) a
    ) previousSessions on previousSessions.domain_userid = sessions.domain_userid and previousSessions.nextSessionIDX = sessions.domain_sessionidx
    --get the URL visit and times for each URL of each session
    join
    (
         SELECT
            domain_userid,
            domain_sessionidx,
            page_urlpath,
            min(collector_tstamp) minTstamp,
            max(collector_tstamp) maxTstamp,
            datediff(second,min(collector_tstamp),  max(collector_tstamp)) totalTime
        FROM "atomic".events
        WHERE
         (event = 'page_ping' OR event = 'page_view')
         and 
        
        group by
          domain_userid,
          domain_sessionidx,
          page_urlpath
        having
          datediff(second,    min(collector_tstamp),  max(collector_tstamp)) > 0
        order by 
        domain_sessionidx,  
        min(collector_tstamp)
    )urls on urls.domain_sessionidx = sessions.domain_sessionidx and sessions.domain_userid = urls.domain_userid
    order by sessions.domain_sessionidx, mintimefromsessionstart
  )userSessionURLTotals
group by
  userSessionURLTotals.domain_sessionidx,
  lower(userSessionURLTotals.page_urlpath),
  userSessionURLTotals.totalUserCount,
  --substring(userSessionURLTotals.page_urlpath, (len(userSessionURLTotals.page_urlpath))- position('/' in reverse(lower(userSessionURLTotals.page_urlpath)))+1,len(userSessionURLTotals.page_urlpath))
having count(distinct userSessionURLTotals.domain_userid) > 1  --filtering out URLS that only have a single cohort user visiting it
order by 
  domain_sessionidx, 
  urlUserCount,
  avgMinTimeFromStartSession

R CODE:

    library(DiagrammeR)
    library(tidyr)
    library(dplyr)
    library(ggplot2)

    #creating z-scores for each time column
    rawData$scaleMin = scale(rawData$avgmintimefromstartsession)
    rawData$scaleMax = scale(rawData$avgmintimefromendsession)

    #removing outliers for start or end times (zscores +- 2 stdev)
    scaledData = subset(rawData, abs(scaleMax) <=2 & abs(scaleMin) <=2)

    #calculating alpha/opacity based on scaled data set
    scaledData$alphaRatio = scaledData$urlusercount / max(scaledData$urlusercount)

    #converting domain_sessionidx to a factor data type
    scaledData$visitCount = factor(scaledData$domain_sessionidx)


    par(bg = "transparent")

    ##plotting multiple user User Journey, using average start from session start and average end from session start metrics
    ggplot(scaledData, aes(colour=visitCount)) + 
      geom_segment(aes(x=avgmintimefromstartsession, xend=avgmintimefromendsession, y=page_urlpath, yend=page_urlpath, alpha=userratio), size=3) +
      xlab("Duration in Seconds from Session Start") + ylab("URL Path")

#2

Thanks a lot for sharing this recipe!