A number of prospective Snowplow users are interested in evaluating Snowplow against Google Analytics 360, and in particular, the BigQuery integration. As part of the Google Analytics 360 proposition, Google will make available event-level (they call it ‘hit-level’) data in your own BigQuery data warehouse. This is an alternative to Snowplow, for users who are interested in owning their own event-level data, and performing their own sophisticated analytics on that data.
We have a surprisingly large number of users who have both Snowplow and Google Analytics 360. I find this surprising, because Google Analytics 360 is a very expensive product. However, I’ll put my surprise to one side and ask those users to please contribute to this thread, to help validate the differences, pros and cons that we identify between the two solutions.
Out-of-the-box data points available
The Google Analytics 360 BigQuery table definition can be found here.
The Snowplow Redshift table definitions can be found here (for the main table definition) and here (for the additional tables that you’d add depending on your setup, which 3rd party integrations you use and what trackers and tracker settings you emply).
There are a large number of out-of-the-box field and event types that are available via Snowplow trackers that are not available in Google Analytics
- IP addresses of the end user do not appear to be surfaced / available in GA 360 / BigQuery
- Link click events i.e. automatic recording of any hyperlinks clicked
- Page ping i.e. heartbeat events. These include data points on how far down and across a web page a user has scrolled, so you can compute how much content is read and how long users spend on specific content items. This can be used to generate heat maps, for example.
- Performance timing data i.e. data on how fast web pages take to load.
- When tracking events from e.g. Android or iOS SDKs Snowplow will automatically grab the following device identifiers (where available), none of which appear to be available in GA 360 BigQuery:
- IDFA (Open, Apple and Android)
- IDFV (Apple)
Google has a small number of very high value integrations with their own ad-related products
Google has very nice out-of-the-box integrations between Analytics and it’s different ad-buying products.
Whilst our partners at Poplin Data are working on plug-and-play Adwords and DFP integrations with Snowplow, it is doubtful they will be as simple to setup and use as the Google supported integrations between their own products.
Snowplow has a large number of integrations with third party marketing and data providers, that Google lacks
Snowplow has a large number of integrations with the following third party providers, for example:
- Adjust for mobile app attribution
- Augur for machine-fingerprinting
- AWS Cloudfront for log ingestion
- CallRail for call tracking
- Clearbit for enriching user information
- Mailchimp for email marketing data
- Mandrill for transactional email data
- Optimizely for A/B testing
- OpenWeatherMap for weather data
- PagerDuty for devops alerts
- Pingdom for uptime alerts
- SendGrid for transactional email tracking
- Urban Airship for mobile marketing
Approach to defining your own event and entity types
Both Snowplow and Google believe in ‘general event tracking’ i.e. tracking digital events across platforms and channels, not just your standard web and mobile events. In Google’s case this drives a lot of their messaging around “Universal Analytics”.
If you are going to track different types of events from different platforms or devices, you need the ability to schema them. That’s because the structure of an “play video” event is going to look totally different (with totally different fields and associated types) to a temperature reading event, flight check in event or an event in a massive multiplayer online game.
Google’s approach to flexible schemaing: custom events + a long list of custom dimensions + custom measures
Google Analytics has a fixed schema: every company running Google Analytics, be it a travel company, bank, massive multiplayer online game, social network etc. will have fundamentally the same single table with the same columns. This is not surprising because every GA user shares the same data pipeline.
To give users some flexibility, Google provides a large number (200) of custom dimension and (200) custom measures that an end user can assign to different data points. In addition, they support a generic custom event with four generic fields (category, action, label, and value).
Snowplow’s approach: define your own events and entity (context) types
Snowplow users can define their own event types. For each event, they can define as many fields as they’d like.
Further, they can define their own entity / context types. Entities / contexts are things that are tracked across different events (e.g. users, products, videos, articles, bank accounts, reviews etc.)
Because events and entity schemas are versioned, and the data itself is labelled with the appropriate version, Snowplow users can evolve their data schemas over time. In contrast, it is very hard in GA to change the assignation of different custom dimensions and measures without causing a huge amount of downstream analytics pain.
Modelled vs not-modelled (atomic) data
Snowplow data is atomic (i.e. not modelled). It is up to each Snowplow user to model their own data according to their own business logic.
In contrast, the data that GA delivers into BigQuery is already modeled i.e. Google has already decided which user performed each action and what session each action belonged to.
There are pros and cons of both approaches. The GA approach delivers a data set that’s simpler to get started with, because no modeling is required. However, most Snowplow users find that an enormous amount of value is created by being able to define their own modelling process (e.g. their own logic for identity stitching, and aggregating over sequences of events).
Whilst this should be possible with GA data in BigQuery, in practice I’ve not met anyone who has e.g. recomputed on their entire data set with a new identity stitching or sessionization algorithm. So it may be practically very difficult. I’d love to hear from people who have done this.
My understanding (correct me if I’m wrong) is it takes 24 hours + for GA data to hit BigQuery. (A subset of aggregate data is available via a separate real-time reporting API.
In contrast, Snowplow users can run their pipelines hourly, loading data into Redshift with a 2 hour latency. Users of the Snowplow real-time pipeline have the full event stream available to consume in Kinesis and Elastic in seconds.
BigQuery vs Redshift
Snowplow does not currently support loading data directly into BigQuery. (Although this is on our roadmap.) As a result, most Snowplow users do most of their analysis on Snowplow data in Redshift. (Although Spark on the data in S3, and Spark Streaming / AWS Lambda / Kinesis Analytics on the Kinesis streams are also increasingly popular ways to consume Snowplow data.)
BigQuery has a lot of great features that Redshift lacks - in particular elastic compute.
Both technologies struggle when it comes to big-table to big-table joins. The approach for working around this issue is different with both:
- With BigQuery users are encouraged to nest data in single tables. The GA data structure reflects this
- With Redshift big table to big table joins can be made performant by intelligent use of
I suspect this is what makes remodelled event data in BigQuery quite difficult but don’t have enough experience to date to know one way or the other.
What have we missed? What have we got wrong?
I’d love to have our users who are use both GA 360 and Snowplow in anger help us build out this working document. Feedback / comments etc. very welcome!