Modelling page view events as a graph – Snowplow

In the previous post in this series we started exploring options for modelling event data as a graph in general. We looked at three ways of modelling atomic event data:


This is a companion discussion topic for the original entry at https://snowplowanalytics.com/blog/2018/08/13/modelling-page-view-events-as-a-graph/

At Dripit we started out as a behavioral analytics company. Our first hypothesis was that there must be patterns which can be picked up in visitor data. And of course, we thought that graph representation could be an interesting beginning. The result was really messy and sloooow representation. When we did some data preparation and actually picked up sequential milestones, we were able to build a model which could predict in a real-time likelihood of conversion and context of a visit. At the time there were just a couple of graph databases and we ended up using HBase/Redis to store behavioral data.

In conclusion. In our case it looked like graph would be a perfect solution but it was way to easier to use a simpler data model and NoSQL databases to solve our problem. It was a good engineering exercise, nevertheless!

1 Like

That is a great observation @ernest! And it’s something we’ve been thinking about as well.

When you say you end up with a ‘messy’ graph, do you mean aesthetically or does it have performance implications as well? In the experiment described in this post, I could very quickly see that extreme denormalisation – while ensuring you cater for the large majority of use cases – results in a graph whose visual representation is unintelligible. That is why there are not a lot of pictures in this post; and in the one that I included, I had to dramatically cut down the number of represented nodes.

But I wonder: is that messiness superficial or does it have implications for the end analysis.

We’ve definitely considered more narrow use cases, with prepped data; and future posts will expand on those.

Regarding messy graph. In our heads we had perception that there could be some sort of direct graph how people move towards “conversion” event. Something like Sankey diagram. But in reality there were really little overlapping tracks. There are tonns of unique ways how people get to that one event. And this has implication also on performance and the value of analysis. You can see how it looks like in GA path analysis. At first you are pumped (o boy, o boy path analysis). Then you see it and understand that there are people who exit page for pages where other people have come from.

We had to come up with a meta journey and aggregate the nodes. For example, rather then considering each unique page, we categorized them in Product pages (with parameters like price, time on page), Category pages, Info pages. Now the graph actually started to look like something but at this point we also saw that there was ability to have flat representation which is more suitable for predictive models.