Sending Google Analytics events into Snowplow

As part of our drive to make the Snowplow community more collaborative and widen our network of open source contributors, we will be regularly posting our proposals for new features, apps and libraries under the Request for Comments section of our forum. You are welcome to post your own proposals too!

This Request for Comments is to allow data sent using the Google Analytics JavaScript tag to be successfully processed via the Snowplow pipeline and available for analysis as Snowplow events and contexts.

This would enable any Google Analytics and/or Measurement Protocol user to send exactly the same HTTP(S) requests to Snowplow as to Google, so that they can:

  1. Capture and analyse their event-level data in their own data warehouse
  2. Process and act on their full event-stream in real-time

1. Why integrate Google Analytics into Snowplow?

Google Analytics is the most widely used digital anaytics platform in the world. And for good reason: it’s a great product - and it’s free!

However, as all Snowplow users will be aware, there are significant limitations with Google Analytics - especially with the free product:

  • Access to your own data is mediated via Google. You can access your data via the Google Analytics UI and APIs, but there are many restrictions on what data you can fetch, the volume of data you can fetch and the granularity of data you can fetch. In addition, only a subset of data is available in real-time
  • Google Analytics applies a standard set of data processing (modeling) steps on the data that are standard across it’s enormous user base; this data modeling includes sessionization and marketing attribution. These steps are not necessarily appropriate for all users
  • Google Analytics data is sampled. You can understand why Google would want to fall back to sampling: this has significant implications when you’re providing a product like Google Analytics, with such an enormous user base, for free. But it is a pain if you want to perform very particular analyses on very particular subsets of users, for example, because the data becomes unreliable as the sample size drops

Many of the above reasons are motivations for Google Analytics users to setup Snowplow alongside Google Analytics. However there is some overhead to doing this, particularly on the tracking side: for every Google tag that you create, you need to integrate a comparable Snowplow tracking tag.

By adding native support for Google Analytics and the Measurement Protocol to Snowplow, it should be straightforward for any GA user to add a single small snippet of JavaScript to their setup to push their data to Snowplow as well as GA, and thus benefit from all the opportunities that Snowplow opens up for them.

2. Existing Snowplow experience with Google Analytics

2.1 Inspired by the original GA event types

Although the Snowplow Tracker Protocol is bespoke to Snowplow, a large number of our original event types were closely modelled on equivalents found in the Google Analytics JavaScript SDK.

For example, Snowplow supports:

All of these were closely modeled on the Google Analytics equivalents; this also means a certain level of overlap between properties in the Snowplow Tracker Protocol and the Google Analytics Measurement Protocol.

2.2 Adding native Enhanced Ecommerce support

In 2016 we implemented support for Google Analytics’s Enhanced Ecommerce plug-in in Snowplow.

Because a number of Snowplow users were coming to Snowplow from Google Analytics, having already implemented Enhanced Ecommerce, we added native support for Enhanced ecommerce tracking to our own Snowplow JavaScript Tracker.

This allowed Google Analytics users to mirror their Enhanced Ecommerce integrations in Snowplow directly, cutting down implementation time.

3. A proposal for integrating Google Analytics events into Snowplow

3.1 On the Google Analytics side

To build this integration we can make use of Google Analytics support for third-party plugins.

We will build a simple open-source Google Analytics plugin, which intercepts the Measurement Protocol payloads being sent to Google Analytics, and also sends them to your Snowplow collector. We have started work on this plugin and you can follow our progress in this pull request.

Once it’s deployed, you’ll be able to leverage this plugin simply by adding the following to your existing Google Analytics setup snippet:

<script>
  /*
    ...
    Regular GA invocation code
    ...
  */
  ga('create', 'UA-XXXXX-Y', 'auto');
  ga('require', 'spGaPlugin', { endpoint: 'events.acme.net' });
  ga('send', 'pageView');
</script>
<scipt async src="https://d1fc8wv8zag5ca.cloudfront.net/sp-ga-plugin/0.1.0/sp-ga-plugin.js"></script>

The endpoint is your current Snowplow collector’s endpoint.

3.2 On the Snowplow side

Under the hood, Snowplow is in fact broadly protocol-agnostic - alongside the Snowplow Tracker Protocol, Snowplow has integrated support for the protocols of each of its supported third-party webhooks.

To send Google Analytics events into Snowplow, we therefore need to add support for the Google Analytics Measurement Protocol into Snowplow.

Broadly this involves:

  • Defining JSON Schemas for the Measurement Protocol and associated Google Analytics entities, and hosting them in Iglu Central
  • Writing a custom adapter inside of the Snowplow Common Enrich library which can process the Google Analytics Measurement Protocol

3.3 Overall architecture

Putting all of this together, we end up with a technical architecture looking like this:

4. Mapping the Google Analytics payload onto JSON Schemas

4.1 Mapping approach

Google’s Measurement Protocol is an incredibly extensive specification, representing the exhaustive list of all data points that a Google Analytics user (or direct Measurement Protocol user) can send in to the platform.

We considered three approaches to mapping all of these data points into JSON Schemas:

  1. Smallest viable entities - where we break the GA data down into a large set of tightly-defined entities
  2. Mega-model - where we create a single huge schema holding all of the data points
  3. Hybrid - in-between the first two approaches, with a handful of relatively large schemas

We discarded the hybrid approach, because we didn’t want to be responsible for interpreting or curating the Measurement Protocol; for this RFC to be successful, it is important that our Google Analytics mapping is unopinionated and doesn’t involve any “Snowplowisms”.

We then chose the “smallest viable entities” approach over the “mega-model”. Many small independent (if inter-connected) entities is more in line with our general thinking on instrumentation at Snowplow; it also sets us up nicely to move towards a graph representation of the data over time.

4.2 Comprehensive mapping exercise

Having decided our approach, we then compiled a Google Sheet with a row for:

  1. Every private or undocumented field that we have observed being sent by Google Analytics (e.g. _v for the SDK version number)
  2. Every field documented as part of the Measurement Protocol

We then set out to map each of those fields onto a property within a new JSON Schema that we would add to Iglu Central.

You can find this spreadsheet here:

We have configured this spreadsheet so that you can comment directly on it, if you find that more convenient than commenting on this thread.

4.3 Implementation of the smallest viable entities approach

Under our chosen approach, a single Google Analytics payload will be processed by Snowplow into a single Snowplow enriched event. This enriched event will consist of multiple self-describing JSON entities, specifically:

  1. A single self-describing event in the enriched event’s unstruct_event field. The type of self-describing event will be determined by the Google Analytics’ hitType
  2. Zero or more self-describing contexts, added into the array held within the enriched event’s contexts field

Let’s take the pageview hit type as an example. It will result in an event with:

  • A self-describing event based on the page_view entity
  • A list of additional self-describing contexts conforming to the schemas:
    • user
    • hit
    • system_info
    • etc

Let’s next look at some particular challenges in the mapping that we had to address.

4.4 Dealing with multi-dimensional fields

Some fields in the Measurement Protocol are “multi-dimensional”, where the field name itself is overloaded with multiple numeric indexes which precisely specify the data point being referenced. Consider the field:

il<listIndex>pi<productIndex>cm<metricIndex>

This “multi-dimensional” field name identifies a single value in the Measurement Protocol, such as:

il2pi4cm12=45

For our mapping, we will break this into four fields within a single entity:

{
  "listIndex": 2,
  "productIndex": 4,
  "customMetricIndex": 12,
  "value": 45
}

4.5 Dealing with currency

We’ve included the cu (currency code) parameter in all schemas that have a price field. We felt that as a practical matter, the currency should always be in the same table as the price.

4.6 Schema definitions

With the mapping completed and the approach decided, the next step was to draft all of the required schemas in a branch within our Iglu Central project. Iglu Central is a central public repository of schemas deemed to be of general use to the Snowplow community (and beyond) - this should be a great home for the Google Analytics’ JSON Schemas.

The schemas that we drafted are as follows:

Events Contexts
MP/page_view * MP/page_view *
MP/event
GA/private
MP/exception
GA/undocumented
MP/item
MP/app
MP/screen_view
MP/content_experiment
MP/social
MP/content_group
MP/timing
MP/custom_dimension
MP/transaction
MP/custom_metric
MP/general
MP/hit
MP/link
MP/product
MP/product_action
MP/product_custom_dimension
MP/product_custom_metric
MP/product_impression
MP/product_impression_custom_dimension
MP/product_impression_custom_metric
MP/product_impression_list
MP/promotion
MP/promotion_action
MP/session
MP/system_info
MP/traffic_source
MP/user

* Note that page_view can be an event or a context, depending on the Google Analytics hitType.

None of these schemas have been merged into Iglu Central yet - we welcome your feedback on them! Feel free to comment directly into this pull request:

5. Integration into Snowplow

5.1 Integration principles

The next consideration is how to integrate the Google Analytics data points into Snowplow such that we:

  1. Make use of Snowplow’s own powerful features as much as possible, but also:
  2. Process the Google Analytics events as close as possible to how Google’s own systems process them

5.2 Populating fields in the Snowplow enriched event as well as the new contexts

One observation was that some parameters in the Measurement Protocol have unambiguous equivalents in the Snowplow enriched event. For example, the de or documentEncoding field in the Measurement Protocol maps directly onto the doc_charset field in Snowplow’s own enriched event.

Where these mappings are straightforward and noncontroversial, we propose populating the Google data point into the Snowplow enriched event field, as well as populating it into a dedicated context; you can see these “secondary mappings” in the blue Assignment columns on the right-hand side of the Google Sheet.

Please let us know if you disagree with any of these mappings.

5.3 Running Snowplow enrichments on the Google Analytics data

Because we are populating the fields in the Snowplow enriched event, various Snowplow enrichments will work with the Google Analytics data “for free”, including:

  • Page URL parsing
  • Referer parsing
  • MaxMind geo-location lookup
  • Both useragent parsers

Fully configurable enrichments, such as the API request enrichment and the SQL query enrichment, can be used with the Google Analytics integration, just by providing Measurement Protocol events and context schemas as part of the input data to the respective enrichments.

The currency conversion enrichment will not work, as it is currently hard-coded to the built-in Snowplow ecommerce events.

5.4 Thoughts on supporting other behaviors

There are some interactions between the Google Analytics data and Snowplow that we are less clear on; these are flagged in orange cells in the Snowplow Notes column.

In particular:

  • Whether we should enforce the IP address anonymization on a per-request basis, given that this is a feature that Snowplow does not support yet, although we are planning to add it, per Snowplow JavaScript Tracker issue #586
  • Whether we should map the Google Analytics uid onto Snowplow’s own user_id
  • Whether we should enforce Google Analytics’ IP address and useragent overrides - we are leaning towards enforcing these, given that we have equivalent override functionality in Snowplow

We would appreciate your input on these aspects, and all others!

5.5 Ongoing Snowplow development work

We would be remiss if we did not flag that Snowplow data engineer has started exploratory work implementing this RFC, which you can find in this pull request:

While work in this PR is relatively advanced, development on this is paused while we wait for the community’s feedback on this RFC.

6. REQUEST FOR COMMENTS

This RFC represents a significant new step for Snowplow as we expand the scope of what can be tracked with the platform. We are excited about the opportunities for opening up Snowplow to existing Google Analytics users, and are interested in the impact of fully supporting a second web analytics protocol alongside Snowplow’s own protocol.

We welcome any and all feedback from the community. As always, please do share comments, feedback, criticisms, alternative ideas in the thread below.

In particular, we would appreciate hearing from people with extensive experience working with Google Analytics tagging and the Measurement Protocol. Does our proposed integration match the way you would expect to work with Google Analytics data in Redshift, Kinesis or Elasticsearch?

8 Likes

Lots of food for though here.

What’s the best way of capturing the labels associated with custom dimensions and metrics? This state doesn’t exist in the raw data from GA so should it be enriched using a configuration file or API proxy (GA API required OAuth2). Should a label column be added to the Iglu schemas for MP/custom_dimension and MP/custom_metric?

This also introduces a question around the versioning of custom metrics and dimensions. If I change the value that is being sent in a custom metric (say shipping weight from pounds to kilograms) how should I version this? The data for a single metric will no longer be backwards compatible but my other metrics which I haven’t changed will be - do I create a new model schema version or do custom metrics/dimensions need some independent versioning?

This is great work so far - the reasoning behind collecting and mapping Google Analytics data makes sense to me.

But let me take a step back and ask what the business objective of this RFC is. Do we want to appeal to relatively inexperienced teams currently using Google Analytics and show them the power of owning their data? Do we want to make it easier to track web analytics data by replicating what’s already being tracked on GA?

Talking to other companies, I see a clear need/desire to do more with their web analytics data, but the decision to implement Snowplow is postponed for two main reasons:

  1. It’s not an easy task to set up and maintain the data pipeline.
    There are a lot of moving parts and people are afraid they won’t have the time and/or technical skills to fully commit to Snowplow. And since they have no idea what they’re losing, they’re also uncomfortable paying for the managed solution. My point here is that once teams are past this stage, replicating GA events to Snowplow is the easiest part of the journey. Most teams are already using Tag Manager, which makes it super easy to set up an additional tag to collect events – I don’t see why a team that has committed to implementing Snowplow on AWS would be intimidated by the JS Tracker. If that makes sense, we can probably lower the entry barriers by automating the setup process via e.g. Terraform

  2. It lacks a data visualization layer.
    People expect that once the pipeline is running correctly, they’ll be able to quickly generate insights from data, as they would with e.g. Mixpanel. But they’ll face a steep learning curve to understand how data is stored and how they can build their first analyses. What’s a domain_userid, a derived_tstamp, a page_urlpath, etc. – it all looks very intimidating at first. If that’s the case, replicating GA events to Snowplow is also the easiest part of the journey. They’re not attached to the GA measurement protocol, they’re attached to the UX of the tool - it’s simple enough for everyone to use. I think people will be comfortable fully switching from GA to Snowplow if they have a visualization layer to quickly understand what can be done with all that data. They won’t know what a domain_sessionidx is, but will be able to compare new vs. returning visitors. If this makes sense, we can probably build something cool with Metabase - it’s a great open source tool that could really complement Snowplow’s offering.

These are my 2 cents! Looking forward to hearing your thoughts

Cheers,
Bernardo

1 Like

Good question! We think, to start off with, the best way to deal with this will be for people to maintain a CASE WHEN or similar lookup in SQL that converts the index for any given custom dimension or custom metric to the associated hard-coded label.

Eg:

SELECT
  CASE
    WHEN index = 1 THEN 'Difficulty'
    ELSE 'Other' 
  END AS custom_dimension_name,
  value AS custom_dimension_value

FROM atomic.com_google_analytics_measurement_protocol_custom_dimension_1

which would produce a result along the lines of:

| custom_dimension_name  | custom_dimension_value |
| Difficulty             | Easy                   |
| Difficulty             | Medium                 |
| Difficulty             | Hard                   |

Does that make sense?

1 Like

@bernardosrulzon you raise a great point! Who are the people we see as benefiting from the RFC? How does it address the two issues that tend to prevent people setting up and running Snowplow i.e. the complexity associated with setting up and maintaining the data pipeline, and the lack of a data visualization layer.

The RFC is aimed at two different user types:

  1. Existing GA users who are interested in Snowplow but need to make a business case to the rest of the organization to invest the resources in it. Often these users have one or two highly valued use cases (e.g. joining their web behavioural data with CRM data, for example) that they want to use for a PoC. This makes it easy for them to quickly gather a complete data set that is as comprehensive as their GA data set, but opens up the possibilities of going beyond what GA enables them to do and demonstrate the value in owning their own data.
  2. Existing GA users who want GA360 for the BigQuery / complete data set, but cannot afford it. (They may or may not have heard about Snowplow.)

This RFC of itself doesn’t address the two blockers you identified in your post. However, we’re taking other steps to mitigate them:

1. It’s not an easy task to set up and maintain the data pipeline.

For Snowplow Insights customers we setup and run the Snowplow pipeline as a service, so have addressed this issue for them. We would like to do more work to make open-source setup and maintenance of the pipeline easier.

2. It lacks a data visualization layer

We’re (slowly) build out more data models that should make it easy to integrate Snowplow with different visualization platforms including open source solution like Redash, SuperSet and Metabase. The challenge is that two different Snowplow users will have two totally different data sets, so there’s a balance to be struck between making visualization easier without discouraging users to make the most of one of the features that makes Snowplow so powerful. (The ability to schema your own events and entities.)

It would be relatively straightfoward to build a “standard” set of visualizations for users of the RFC, because they would all have the same data structure.

Does the rationale make sense?

4 Likes

That should work as Google does recommend against reusing custom dimensions/filters where possible as it’s difficult to make this data backwards compatible.

What about handling custom dimensions/metrics for multiple properties in which an index 1 may map to a custom dimension in Property A but a different custom dimension in Property B?

@mike The tid parameter maps on to trackingId in the general schema.

So maybe something like:

SELECT
  CASE
    WHEN cd.index = 1 AND g.trackingID = 'UA-XXXX-Y' THEN 'Difficulty'
    ELSE 'Other' 
  END AS custom_dimension_name,
  cd.value AS custom_dimension_value

FROM atomic.com_google_analytics_measurement_protocol_custom_dimension_1 AS cd
JOIN atomic.com_google_analytics_measurement_protocol_general_1 AS g
  ON cd.root_id = g.root_id
2 Likes

Yep - that would certainly work. I suppose if there are a lot of variations someone could always add a mapping table and join to that as well.

Great idea, looking forward to test first MVP.
Small question: will it work if GA is working through Google Tag Manager?

@ilya.kozlov - yes it would!

This looks great. While I agree with @bernardosrulzon when you say that if the maturity to think and model event data is there this feature is not so relevant I believe it does play an important role in increasing Snowplow Analytics adoption.

Not all companies are ready, they have a journey and a million questions about Snowplow (why there is not an interface? who are these guys? open source? is it any good?). The other day someone was in panic because they were navigating in the GA interface and saw a high bounce rate for a login protected area (people refreshing every 40 minutes to see updates), Sometimes the interface makes people get lost without the proper guidance, but I digress, this is another matter.

I think this feature can play a role in marketing and help market Snowplow to the biggest web analytics base: Google Analytics users. Some of the companies want to evolve, the next step is event level data. At that point they either pay $150k a year or go Snowplow.

Here are two cases for which I think this feature would be highly relevant:

  • Get clickstream data from their Google Analytics Standard account without paying $150k a year.

GA Standard does not provide clickstream data, using this feature you could create a 1 click implementation and have clickstream data, and keeping GA free while your analytics maturity evolves.

  • Provide a smooth transition period for Google Analytics 360 before discontinuing the 360.

They want to save $$$, and would like to create a smooth transition from GA360 to Snowplow Analytics, they could be just phasing out 360 and keeping Standard for “confort” and Snowplow for more advance analysis. Keeping them side by side gives them time for the transition and provides an almost seamless transition for its internal users.

IMHO this feature is more oriented to position Snowplow in the web analytics market and take advantage of the GA adoption, than for its current users, and could play an important role in growing the Snowplow user base.

1 Like

I absolutely love this idea and have already experimented with “hijacking” the Google Analytics request.

What I miss in the RFC are other platforms, most importantly iOS and Android. I assume that most Snowplow users do advanced analytics which often times includes mobile platforms.

I don’t know of any plugins for the SDKs but maybe it’s possible to duplicate their requests as well.

2 Likes

Hi @ian - mobile is a great suggestion. That’s something we can look at in a phase 2 release - in the meantime if anybody in the community knows if this is possible, please share!