Building a model for event data as a graph – Snowplow


#1

In recent months we’ve been busy expanding the variety of storage targets available for Snowplow users to load Snowplow enriched events. We recently launched our Snowflake Loader, and work is underway to add support for Google’s BigQuery. Thinking even further ahead, one intriguing option is to add a graph-based storage target for Snowplow.


This is a companion discussion topic for the original entry at https://snowplowanalytics.com/blog/2018/03/26/building-a-model-for-atomic-event-data-as-a-graph/

#2

Thanks to @dilyan for researching and writing this amazing post. We’ve been seeing graph databases becoming more popular, but I’d love to know what your experiences or impressions might be.

  • Have you considered using a graph database?
  • Are you interested but the barrier of entry seems too high?
  • Do they seem useful but not for your business case?

Let us know what you think, and make subscribe to our email list (if you haven’t already) to be notified as Dilyan’s series continues.


#3

Some thoughts, based on experience:

  • atomic event approach although easy to explain, is almost usable in production environment (with 1,5+ million events/day building graph edges time is inacceptable for 1-2 week time window)
  • approach with aggregation makes more sense, but in many cases requires different graph structures for different analysis

BTW: Don’t even think to call your graph model Markov chain;-)


#4

We had thought about using a graph database, but where we struggled was the architectural purposes - would this be used as a permanent data store for analysis? Or used in a production app, with almost ephemeral queries/requests, for use in a recommendation engine? (Which is one of the most common uses of a graph database).

We never really got to the bottom of it, and the languages to learn and manipulate the graph data was too much of an overhead in the end.


#5

@grzegorzewald Have you tried smaller batches? Perhaps daily? And was that in Neo4j?

Is the Markov fanbase touchy about the legacy then? :wink:


#6

That’s a great question @jrpeck1989, and one we’re hoping to answer as part of this effort.


#7

Interesting point about the language overhead. I like that Cypher is SQL-inspired, so at least you have a little common ground, but because it’s specific to Neo4j, I can see why it’s a significant time and energy investment. Amazon is saying that Neptune will use, “the popular graph query languages Apache TinkerPop Gremlin and W3C’s SPARQL” so maybe there’s more utility and longevity in working with one of those?


#8

With any new/different technology there’s an initial overhead to getting going with it.

From what I’ve come across most of the graph databases’ languages themselves are relatively intuitive and easy to pick up, but there’s definitely an overhead to how you think about, shape and use the data compared to a columnar database like Redshift.

IMO that’s the real challenge as opposed to the language itself - much like there’s an initial overhead to learning to use SQL effectively if you’ve only ever used R (eg: “Why can’t I just swap the rows and columns?”), or vice-versa.


#9

Oh absolutely, there’s an inevitable overhead and learning curve, but you should be able to see the growth in productivity and value that comes as you become more familiar with something like this.

The difficulty when it comes to learning a language like Cypher is the completely different frame of mind you need to address a problem or a question. As you say, when your mind is fixed in a certain way of thinking (SQL for instance), then addressing something in a graph takes time. Having to think of things in terms of relationships rather than attributes in a row is a shift, and like I say, we never eventually saw the value that could be added (combined with the time to get there) over SQL querying Redshift.


#10

@dilyan

Have you tried smaller batches? Perhaps daily?

Of course, but does not make sense. Websites/visits have longer penetration retention - it does not make sense for this particular application

And was that in Neo4j?

Yes sir:-)

Is the Markov fanbase touchy about the legacy then?

For me definitely. It is like talking about white noise, when you are sure, you have color… And than you design a Kalman filter pretending there is no color;-)


#11

Definitely, I can see that perspective.

There are some clear use cases where the benefit is quite obvious - for example if you’re interested in mapping out the user journey and some kind of non-simple path analysis, that’s a huge pain in a columnar structure, but graph makes it quite manageable (especially if there’s a non-linear path through your product - eg a signup flow is usually a manageable a>b>c>d, if you’ve got a path that’s a>b/c/d/e>b/c/d/e/f>etc. then good look modeling it in traditional SQL).

It’s interesting because there’s quite a lot of interest in graph DBs without widespread uptake yet (or at least without many people shouting from the rafters about their use cases). My hunch is that this is down to the fact that the first thing you do isn’t usually the use case for graph (the example above is something you wouldn’t want to do until you know a lot of other things about your product). I often wonder if that’s down to graph being more suited to late-stage analytics, or people’s experience/comfort level being more aligned to columnar.


#12

@grzegorzewald We’re definitely interested in exploring this from both ends, the atomic level as well as modelled data. The first use case is where we might run into performance issues, as you note.

With 1.5m events per day, if you were to write new data to the DB every 5 mins, what do you think the implications would be? The end goal for that would be to store the atomic.events data as a graph in its entirety – to have that be your source of truth. What are your thoughts on this?


#13

Hi @dilyan,

The issue in my case is that i have rather a complex web app than web page/portal. Focus on higher level of abstraction makes more sense in my case.

I am not thinking about real time graph building at the moment. I believe I could kill my n4j quite fast - currently I use Elasticsearch with 3 day data to debug production data (5 node cluster) and it is not performing very well :wink:

As I am not thinking about recommendation machine (at least for now), I have no point to create real time graphs.