Page_id crosses sessions and over inflates time

Hello,

We found something when trying to calculate time spent and found it odd. The following all came about when calculating the temporal length of a page_id with page_view event and the max derived_tstamp for all subsequent events within that page_id.

Scenario: A visitor comes to our site for the first time, views 1 page and then start’s their work day, and 3 hrs later at lunch clicks around the existing page, filters or something like that but they don’t refresh the page. What you will get in this scenario is; 2 sessionidx with 1 page_view_id crossing the session and an engaged time of 10,800 seconds as per the Snowplow models.

Why is it that once the _SP session Id ends the page_id does not reset until the page is manually refreshed?

Outside of the persistent domain_id cookie. Should the tracker not should dump all current event ids on _sp session end? as no existing event id should exist outside the session.

The issue is that then the page_id crosses multiple sessions if the page_id is kept, example below, I haven’t cherry picked as its not a small issue.

Screenshot 2021-07-21 at 13.43.56

Due to this engaged time is very much artificially inflated. Then when applying the SP models the roll ups to sessions and users are compounded/multiplied as 1 page crosses many sessions as well as engaged time.

We could spend time rewriting the models and putting rules in place that no page_view can exceed the session of 30mins. That a page_id belongs to the first session it was seen in and not subsequent sessions, but what to do with the following interactions is the issue, do we manually rekey the id, sounds messy.

I feel that the tracker should probably end the page_id on session end, then if a new session is created on the same page, then the page event id should change too. I do understand this would create a new page_view event or possibly need to be handled by a new event type, I don’t know, like page_view resume event.

Any thoughts/suggestions are very much appreciated.

Thanks
Kyle

Hi Kyle,

I believe that what you describe is related to this existing issue in the javascript tracker repo. Do feel free to drop your thoughts in there.

I think it’s an interesting one and actually your suggestion is along the lines of one of the possible solutions I think - but I think it’s an issue that needs a bit of thought as it could have a very wide impact.

Now, what you describe as happening in the web model seems very different to what we’ve seen before (and what we see in our internal testing, which includes test cases for this issue)… So I suspect there’s something else fishy afoot here.

Conceptually, the engaged time metric isn’t difference between timestamps in our standard models, it’s based on a count of page pings. So in order to see results like this you would need to be counting many many page ping events, each stray ping can only make a difference of 1*{heartbeat}.

Which web model are you referring to? Is it one of the latest and greatest from this repo? Have you made any changes to the source sql in the model?

Hey @Colm

I’ll drop my thoughts in here thanks, I understand it has a wide impact but I’m an absolute firm believer no existing event id created in the session should ever exist outside the session, it breaks hierarchy. That said, I’m not complaining just super invested in the technology and think its fantastic.

So first thing to be aware of when it comes to pingss I aggregated them, as per SnowPlow blog post, and used the Beacon API. It really came down to record volume, 1 page_view creating 24 pings on average, with repeating information that doesn’t change at the page level like UserAgent etc was just more overhead than I wanted, storage is cheap but I wanted to manage the Kinesis cost. I’d love a non-aggregate ping lite method, I’ll add my thought to the repo on that.

A bit of a side tangent; I’d like to say on the face of it for most implementers page_views, pings and Kinesis cost isn’t really a concern and pings without aggregate is the way to go with Snowplow, it’s just my personal use case is our global page_view footprint is a tad on the side of ridiculous, so it really adds up.

So with using the Beacon API, unfortunately the trade off is I’m not always going to have the unload event, around 25% of the time. So we started modifying some of the models down the pseudo lines of…

SELECT PAGE_VIEW_ID , DATEDIFF(second , MIN(DERIVED_TSTAMP), MAX(DERIVED_TSTAMP))....
FROM  SNOWPLOW_ATOMIC..abc..
WHERE DERIVED_TSTAMP >dateadd('month', -1,DATE_TRUNC('month', CURRENT_TIMESTAMP())) 
AND event_name in ('page_view','page_unload', 'event')
.....etc
;

In essence look for the last event derived_tstamp for the page_view event ID to calculate time in the absence of unload, which ultimately brought me to making this post as then I saw the page_id cross the session barrier by other event ids carrying it. SnowFlake page_views model will be ok, as I had to cap, but it’s the Sessions that has me worried now, as that page_id can sit in many sessions with >2000 seconds for example. Fundamentally unless the tracker gets updated at some point down the road we are in the position needing to add our own logic into the models to get around this but its the nature of such things.

Thanks
Kyle

1 Like

Definitely agree with this - but page_view_id isn’t an event_id per se it’s a web_page_id - which is session agnostic.

This id is really just an identifier for an instance of a page as long as that page is open - I think the current behaviour is somewhat desirable as it seems more breaking to generate a new id for a page which in fact is the same page, just in a different session (as it muddies the state of events tied to that page).

I don’t think it would be overly difficult to add a method to reset this but that feels like it should be an opt in, or a change in the data model to accomodate engaged time across the same page, but for multiple sessions.

1 Like

I’m with @mike here. I think theres two routes to take but the current behaviour is correct as it is the same page view.

  1. Add some extra context from the JS tracker, so you can tell this is the second/third/fourth session for a page view. This is a reasonable change and will need designing with the existing content and atomic event data in mind.
  2. Update the data model to find page views with multiple sessions. We assume its a 1 to many relationship between sessions to page views but in reality its many to many. It’s a lot more complicated to model that however.
1 Like

I don’t think I have much to add to what Mike and Paul have mentioned re: the tracker, so I’ll focus on a practical approach to your modeling.

I can see two ways to work around the problem you’ve outlined:

  1. Adjust your query to group by session_id when you perform your DATEDIFF for engaged time, then later make sure the correct row is used when you join it to your final table (I would probably perform this aggregation separately to others due to increased risk of duplicates). Note that I would name this field absolute_time - since the engaged time metric only registers time spent active on the page (We have both absolute time and engaged time in the latest models). This will make sure that ‘stray’ pings will be disregarded.

  2. Adjust the ping aggregation along the following lines:

  • Page Ping callback updates an object which counts distinct pings triggered
  • Global context (or manually tracked context) is attached to all relevant events with the latest count of pings. This would mean that each event has a context which tells you the most up to date ping count when that event happened. Let’s call that ping_count
  • A similar query is used in modeling, except rather than difference between timestamps, the metric you now use is heartbeat*MAX(ping_count) (assuming that heartbeat == minimumVisitLength)

This way, you can count stray pings on the original page view if you like, or you can combine both.

At least in my mind this works around the issue you’re having - wdyt?

I understand the perspective, I really do, and yes while page_id is in effect a web_page_id and event ids hanging off it to denote happenings in the page. We should not have a situation of 6hrs engagement, because thats the time between events on the page id.

It’s like saying, I start a computer game up, I agree to the EULA overlay on the start screen. I put down the controller and go do something for 6hrs, I pick up the controller and press X for start. I didn’t engage with the start screen for 6hrs. My engagement was solely the length of the EULA acceptance.

So to step back from it, the concern is fundamentally a temporal one - ‘What is engagement length, and how should it be calculated?’

Option 1 on Paul’s reply (can’t quote)I think this is a happy medium between dropping the page_id at session end, and spending time redesigning models.

1 Like

My two cents:

We should not have a situation of 6hrs engagement, because thats the time between events on the page id.

For me, the key point here is that I don’t believe we have that situation, the feature is just not designed for subtracting timestamps as the method of calculating engaged time. Page pings only fire when the user is active on the page, and the idea is that you count the pings and multiply by the interval.

So, in the gaming example you give, pings don’t fire for 6hrs - they fire until you put the controller down, then they stop, then they re-fire when you pick it back up. You’re engaged in two different sessions, but you’re still engaged on the same level of the game.

I think you’re right here:

So to step back from it, the concern is fundamentally a temporal one - ‘What is engagement length, and how should it be calculated?’

Total engagement time for a page - as we generally define it - is minimumVisitLength + count(DISTINCT page_ping)*hearbeat.

If you want to use absolute difference between timestamps though, I think it’s still workable, you just have to aggregate by session ID as well as page view ID, and then also use session_id later when you join.

(Edit to add: As we define it, ‘engaged time’ is specifically: ‘time the user spent while active on the page’)

Hey @Colm

I did this back at the start, so don’t get individual pings, only aggregate on the unload event. So using pings for engagement does work but only if I get the unload event, which then had me go look for the max event derived_tstamp to figure it out. So the game example would apply if I don’t get the unload of the ping.

This is probably the solution we’ll go for or I might undo ping aggregation, but I need to look at the impact.

Sure, I follow you - ping aggregation is a less well-established pattern, so I do appreciate that we’re still figuring out the most workable ways to implement it etc., and that’s a bit of a bumpy road in this case.

And for sure I do appreciate why you came to this solution - my intention isn’t to criticise that decision. :slight_smile:

I think attaching a context to all events with minimal aggregated information is a good way to solve that problem (as I outlined in option 2).

That way you could use the data from an unload event if you have one, but otherwise use the latest event you do have.

I’m sure there are other ways to skin the cat though!

PS. This is an interesting discussion! Very much appreciate you offering your perspective and thoughts!

Edit to add:

It also just occurred to me that a potentially nice side-effect of aggregating pings into a global context is that important events can then also be analysed against time in a granular way. For example one might look at engaged time up to an ‘add to cart’, and whether that has an impact on propensity to purchase.

2 Likes

I thought of two other potential low-friction options to achieve this aggregation.

  1. Similar to the suggestion by @Colm to grouping by session id - you could create a new column (page_view_session_id) which would be HASH(session_id + web_page_id) that would then give you the ability to aggregate these pings without losing functionality of the current behaviour.

  2. To @PaulBoocock 's suggestion - you could add a plugin that does this (using the new, splendid v3 tracker) or as an alternative gives you the option to regenerate web_page_id or add a page_view_id when a new session begins. This would then give you the best of both worlds and users could easily opt into this behaviour by adding it to the tracker.

3 Likes