Proposing the Snowplow Relay initiative


#1

We are excited to propose the Snowplow Relay initiative.

Snowplow Relay is an initiative for feeding Snowplow enriched events into third-party tools or destinations. Example destinations include SaaS marketing platforms, open-source machine learning engines or fraud detection services. We call an individual app that feeds Snowplow events into a specific destination a relay:

These relays will be open-source, cloud native and designed with the consent of data subjects at the forefront. They will operate in near real-time, running on AWS and GCP.

Depending on your background, you may be wondering how Snowplow Relay compares to the various tag management solutions widely used in our industry. Let’s take a look back at the tag management ecosystem before diving into what makes Snowplow Relay different.

Tag management originated as a tool for web analytics, so let’s start there.

1. Tag management for the web

Working in the web environment, you may well have used an in-browser tag manager, such as Google Tag Manager or Tealium, to route customer behavioral data to third-party SaaS tools.

Let’s call the service that you want to send data to Acmetrics. You would typically configure your tag manager to:

  • Initialize Acmetrics’ JavaScript library (or “SDK”) on your web pages
  • Observe the end user’s behavior
  • Send relevant data about the end user’s behavior to Acmetrics via its JavaScript library

This data flow is shown below:

In-browser tag managers represent a powerful abstraction layer between your website and your business analytics requirements; Marketing teams have often used tag managers to prevent their tagging needs from being blocked or delayed by their peers in IT or Software Engineering.

The Snowplow JavaScript Tracker is very often called from a tag manager - for example, here is our guide to setting up the JS Tracker with Google Tag Manager.

2. Equivalents to tag management for mobile apps

In the mobile app environment, things evolved quite differently to the web. If you want to route in-app behavior to a third-party tool, then you typically have three distinct options:

  1. An in-app analytics manager
  2. An in-app JavaScript tag manager
  3. A software-as-a-service vendor who will route your events server-side

Let’s look at these options in turn.

2.1 In-app analytics managers

An in-app analytics manager is a client-side approach, somewhat equivalent to a browser tag manager: you add Acmetrics’ mobile SDK and any other tracking SDKs into your mobile app, and then the in-app analytics manager presents a unified abstracted interface over those SDKs, so that you can instrument your analytics tracking once, and those events will be sent to Acmetrics and your rest of your SaaS tools:

The primary example of an in-app analytics manager is ARAnalytics, which is for iOS/Mac only.

2.2 In-app JavaScript tag managers

Vendors such as Tealium and Google Tag Manager (GTM) offer a “hybrid” JavaScript-powered approach for mobile apps, where:

  1. You embed an SDK into your app (the Firebase SDK in the case of GTM)
  2. You instrument your app by making calls to the tag manager library to record user behavior
  3. The tag manager library regularly fetches your latest routing rules from the tag manager’s own servers
  4. The rules are typically expressed as JavaScript and invoked in a hidden browser frame inside your app
  5. The in-app events are thus sent to whatever destinations you have configured, directly from the client device

This is a fairly complex workflow - for more details check out these links:

Note that Tealium can also operate as a SaaS analytics router, see below.

2.3 SaaS analytics routers

The more common approach in mobile has been to use a SaaS vendor such as Segment or mParticle to route your behavioral data to your third-party destinations from their own servers.

A tool such as Segment works like this:

  1. You add the Segment library into your mobile app
  2. You instrument your app by making calls to the Segment library to record user behavior
  3. The Segment library sends all of these in-app events to Segment’s servers
  4. From there, Segment routes the in-app events to whatever destinations you have configured

A simplified data flow for a SaaS analytics router is shown here:

3. Challenges with client-side approaches

While in-browser tag managers and in-app analytics managers have been hugely empowering tools for data and marketing teams, their limitations have become manifest over time. The two major issues for client-side approaches are:

  1. Web page or mobile app bloat and slowdown
  2. Data leakage

Let’s cover both of these briefly.

3.1 Web page or mobile app bloat and slowdown

In a browser context, pulling in multiple third-party tracking libraries has often led to significant slowdowns on initial page loads and then subsequent page performance. Tracking down a “misbehaving tag” is a common task for developers and marketers working with tag managers.

In a mobile app context, adding multiple analytics libraries or “SDKs” into a mobile app has inevitably led to increases in the app’s install size; post-install, we then see significant increases in network traffic as each of the analytics libraries transmits its own event stream to its own servers.

3.2 Data leakage

By their very nature, client-side tag and analytics managers bring third-party code, much of it proprietary and obfuscated, into the host environment of website, webapp or mobile app.

As a site or app owner, it is very difficult to limit what that code can do - after all, it is code executing in our end user’s environment, just the same as our code. Instead, we have to scrutinize the terms and conditions of our various vendors to understand how their code should behave.

One of the worst forms of misbehavior for third-party code is around “data leakage”. Data leakage is where third-party code collects identity or behavioral data from the client which goes above and beyond its reasonably-expected remit; a common end-game for data leakage is building some kind of centralized data asset which the offending third-party then monetizes.

These client-side problems have tilted the balance more recently towards server-side approaches - even major in-browser tag managers like Tealium have introduced server-side capabilities.

4. Data governance and server-side data control

Although server-side tag managers avoid the problems of client bloat and data leakage, another challenge is rapidly emerging in that field: that of data governance.

GDPR and the wider data privacy movement reinforce the importance of keeping tight control over how and when behavioral signals from individual data subjects are utilized. Simply put, the idea of multiplexing all user event data to all destinations for arbitrary further analysis and processing seems increasingly problematic in a GDPR world.

The alternative is fine-grained control of behavioral data routing, managed from the server-side. This is a complex area, and could include:

  • Performing identity resolution or stitching to map events to an underlying data subject
  • Capturing consent from data subjects for certain aspects of their digital behavior to be routed to certain third-parties, in support of specific use cases
  • Routing that data to third-party systems
  • The auditing/logging of that data routing, to ensure compliance with regulations and in support of specific data subject rights, such as the Right to be Forgotten

The current crop of server-side tag managers largely pre-dates the data governance challenge; at Snowplow we believe that there is a need to take a fresh approach to routing behavioral data to third-parties, designing in data governance from the start.

5. Introducing our Relay initiative

Snowplow Relay, then, is a new initiative for feeding Snowplow enriched events into third-party tools or destinations, from SaaS marketing platforms to open-source machine learning engines to fraud detection services.

Each individual relay app will run server-side - at this point it is clear that server-side analytics routing is the way forward, for the reasons explained above. Each relay will take the Snowplow enriched event stream as its starting point, transform it into a format which is compatible with the destination and then feed that transformed event into the destination.

Individual Snowplow relays will be open-source, cloud native and designed with the consent of data subjects at the forefront. Let’s cover these values in turn.

5.1 Open source

Open source is hugely important to Snowplow in general and to the Snowplow Relay initiative specifically. We believe building this in the open will:

  • Maximize contributions - we expect that the majority of relays will be authored by others - perhaps Snowplow community members, or the third-party destinations themselves
  • Improve accountability and auditability - in a world where data privacy and governance is increasingly important, Snowplow relays must be auditable by security and data officers. “Black boxes” are untenable here

5.2 Cloud native

Snowplow runs natively on AWS (batch and real-time pipelines) and Google Cloud Platform (real-time pipeline). It’s important that it’s possible to run Snowplow relays on AWS and GCP with a minimum of fuss.

5.3 Data subject consent-oriented

This is the most challenging design goal.

It is relatively easy to create a Relay which simply forwards events into a third-party system with some light structural transformation. It is much more challenging to create a Relay which deeply understands which data subject each individual event relates to, and what that data subject has permitted to be done with that event, for example in terms of routing that event.

We have some valuable building blocks for integrating data subject consent into the Relay initiative - for example, the consent tracking we recently added into our major trackers. However, there are still a lot of unanswered questions here.

6. Anatomy of a Relay

This RFC represents the “draft specification” for building a Snowplow Relay.

The conceptual architecture of a Relay looks like this:

6.1 Key constraints of a Relay

A Relay has the following constraints:

  • It should run in near-real-time
  • It should be stateless - it cannot preserve or retrieve state across multiple events
  • It will work in an at-least once fashion - we cannot guarantee exactly once processing in a Relay

6.1 Core components of a Relay

The core components of a Relay are:

  1. Read stage, from a stream of Snowplow enriched events
  2. Transform stage, where we apply a mapping of the Snowplow enriched event properties to the data structure expected by the destination
  3. Write stage, where we feed the transformed data into the destination

Let’s cover each of these in turn.

6.2 Relay: Read stage

In the Read stage, the Relay will read the event from the Snowplow enriched event stream - for example, the Amazon Kinesis stream or Google Cloud Pub/Sub topic containing the events.

To add additional flexibility, we would like to support filters in the Read stage: filters would let you configure the Relay to silently discard certain Snowplow event types, so that they are not relayed into the destination. The initial filters would likely be an optional whitelist or alternatively blacklist of event types.

6.3 Relay: Transform stage

In the Transform stage, the Relay will apply a mapping of the Snowplow enriched event properties to the data structure expected by the destination. This is the most complex step, involving a deep familiarity with the data structure that the destination is expecting.

We envisage three types of mapping rule:

  • Static, where there is a fixed, universally correct mapping between a specific Snowplow event datapoint and an equivalent datapoint expected by the destination. This static mapping would be hardcoded into the Relay
  • Dynamic, where each Snowplow user would want to set up a custom mapping
  • Hybrid, where that might be a dynamic mapping with a static fallback

Our current assumption is that mapping rules will need to be relatively fixed; Turing-complete mappings (e.g. by using a scripting language like JavaScript) will be out-of-scope.

6.4 Relay: Write stage

In the Write stage, the Relay will feed the transformed event into the destination.

This process will not be immune to a major outage in the destination or the destination’s APIs - a relay may support some minimal retry-on-failure, but it will not provide full guarantees that events will be definitely written to the destination.

7. Certification

We are considering implementing a lightweight certification process to help Snowplow users know which community-contributed relays they can feel comfortable adopting.

The main concerns of a certification program would be:

  • Does the Relay support the current Snowplow enriched event format?
  • Does the Relay support all of the mandatory features that make a relay a relay?
  • Does the Relay support - or worse encourage - any bad behaviors, for example around data privacy?

We could provide Snowplow relays which pass certification with a live GitHub badge to make their status clear.

8. Released and upcoming relays

8.1 Released relays

This RFC is a little “late” - we have been experimenting with the concepts set out above with the release of two initial relays:

  1. Snowplow Piinguin Relay (release post) - a relay which takes PII transformation events from the Snowplow pipeline and feeds them into our Piinguin service
  2. Snowplow Indicative Relay (release post) - a relay which sends Snowplow enriched events into Indicative (currently AWS-only)

We are mindful that these two relays pre-date this RFC, so please don’t treat the design decisions implicit in those two relays as being set in stone; those relays can and will be revised following community feedback from this RFC.

8.2 Upcoming relays

We are currently building a prototype Relay for Amplitude, the product analytics service for mobile apps.

Other relays that our customers and community have expressed an interest in include:

  • Braze
  • Google Analytics
  • Facebook
  • Intercom
  • Vero

If you are interested in contributing to one of the above relays, please create a new thread in our Discourse.

9. Out of scope

We have no plans to support the Relay initiative for users of the Snowplow AWS batch pipeline at this time.

We have no plans to support “historic replay” of an existing Snowplow event archive through a relay at this time - although this would be achievable with some additional components.

As discussed above:

  • We have no current plans to support Turing-complete data mappings in relays
  • We have no current plans to add bulletproof back-off-and-retry to relays, for the case where a destination suffers a sustained outage. This is something we could revisit in the future

10. REQUEST FOR COMMENTS

This RFC represents a hugely exciting new initiative for Snowplow, and so we welcome any and all feedback from the community. As always, please do share comments, feedback, criticisms, alternative ideas in the thread below.

In particular, we would appreciate any experience or guidance you have from working with existing tag managers in general, or ideally server-side routers and multiplexers like Segment and mParticle.

Finally, feel free to explore the Snowplow Indicative Relay and use that to provide feedback. We look forward to your thoughts!


Support for multiple emitters in the mobile trackers
#2