RFC: Snowplow webhook proxy


#1

The current pipeline allows two main methods of sending webhook data to the Snowplow collector via

  1. The Iglu webhook endpoint (via GET or POST)
  2. A custom webhook adapter written in Scala (i.e., the Mailchimp adapter)

This works well for most use cases but raises some problems when dealing with data that may meet some of the following conditions:

  • The data is not well modelled and requires transformation (such as converting epoch seconds to ISO8601 timestamps)
  • The webhook producing service is only capable of sending multiple event types to a single URL
  • The producing service requires authentication in the form of a challenge-response system (e.g., Xero)
  • The schema for a webhook service may vary significantly between identical users of a product (e.g., a form service will only return form field values for the form created)
  • The end user wishes to authenticate the origin of the payload by comparing the signature recieved from the webhook provider to the expected signature (e.g., Stripe HMAC signing) or by checking against a vendor provided secret
  • The producing service sends notifications in a non-standard format (e.g., CSV row, XML document)
  • The producing service sends multiple events in batch (e.g., a subscribe webhook may send 10 user events at once rather than 10 sequential requests)

Although it is possible to factor this in to the current collector service it may increase the cost of development and will be tied to collector/enricher releases.

I’d like to propose a possible alternative to putting this logic into the collector.

A service that sits in front of the collector could act as a proxy by receiving webhook requests, performing any processing and finally forwarding the well formatted request to the collector endpoint.

All three major cloud platform providers (Azure Functions, Google Cloud Functions) have some variation of this service though I’ll focus specifically on AWS as Snowplow currently runs on AWS.

On AWS the service would comprise of a number of API gateway endpoints (URLs) with each URL tied to a single Lambda function. Each Lambda function would contain standalone code to:

  • Accept the incoming request
  • Process/authenticate/respond to the webhook request in question
  • Forward the request (if required) to the collector endpoint

As a core value of Snowplow is durability it is imperative that data that passes and fails this step is kept in semi-persistent storage (this could be a Kinesis stream(s)) to guard against possible data loss.

This approach has several advantages:

  • It is easy to update and deploy functions
  • It is possible to easily manipulate payloads
  • It is easy to serve a response (for authentication or authorization purposes)
  • Invoications of the function are “serverless” so there is less need to worry about scaling resources
  • It is possible to do extensive and isolated testing for each function
  • Errors and exceptions are easy to monitor and alert on (using Cloudwatch)
  • Is cloud platform agnostic
  • Enables decoupling of the collector logic from webhook specific processing
  • Raises the possibility of allowing the webhook processor to perform schema inference (e.g., a pattern sent without a schema could be matched against Iglu repositories before being sent to the collector)

However this approach does come with some caveats

  • An additional (but optional) piece of the pipeline increases complexity
  • If the system does not durably store data to Kinesis/other source this increases the potential for data loss

I’d love to hear what the community thinks about this idea and if you have any other alternative approaches or methods that you’ve used!


#2

Hey @mike - thanks for raising this, it’s a really interesting idea.

As I understand it, anything that could be done across Scala Stream Collector plus Stream/Spark Enrich (let’s call this “status quo”) to implement a webhook could be done in an upstream Snowplow webhook proxy (call this “proxy”) - in other words, the two approaches are functionally equivalent. If I’ve missed something that only a proxy could do, shout!

But assuming they are functionally equivalent, it then becomes a question of which is preferable.

Here are the pros and cons I came up with:

Proxy approach

+ Allows webhook adapters to be written in more languages
+ De-couples webhook releases from snowplow/snowplow releases
+ Prevents webhook-specific functionality from bleeding into Scala Stream Collector (a subset of webhooks)
- Increases code complexity (a proxy app is more complex than the adapter files in the status quo)
- Leads to fragmentation of Snowplow webhook support (multiple competing/conflicting implementations, including non-open-source ones, of the Acme webhook for Snowplow)
- Increases fragility - because you are adding a second ack-less step between the source system and the durable raw event stream [unless you “cut out the middleman” and the proxy writes directly to Kafka/Kinesis/Pub/Sub etc]
- Makes it impossible to support webhooks out of the box in Snowplow Mini

Status quo approach

+ Very simple to deploy - vanilla Snowplow/Snowplow Mini install supports N webhooks. No proliferation of web servers unlike proxy approach
+ More robust - no additional moving parts
+ Well standardized - Snowplow supports N blessed webhooks. Every Snowplow user gets the same events from Acme
- Writing webhook adapters in Scala is intimidating/high barrier to entry
- Over time Scala Stream Collector will have to grow awareness of certain webhooks’ behavior
- Testing is more complex (especially when we add webhooks that require Scala Stream Collector changes)

I’m sure I’ve missed some strengths and weaknesses of both approaches - look forward to further thoughts on this! I think what’s interesting though is that there may be a “tipping point” around the complexity of a specific webhook provider (particularly around authentication/authorization), where it then makes sense to go the proxy route rather than the status quo.

I’d be really interested to see one of the most complex webhooks (Xero?) implemented as a proxy.

I’m also keen to consider whether there’s a “third way”, which combines some of the benefits of both approaches. For example:

  • We know that writing Scala code is a negative for a lot of possible webhook contributors, but actually webhook implementations all follow quite a fixed set of rules - so could we provide a domain-specific language to make it much easier to write these webhooks?
  • We know that testability of these webhooks is important - should we be coming up with an “Avalanche for webhooks”, which makes the roll-out and continuing integration testing of these webhooks much less painful?
  • We know that the webhook adapters are not particularly well isolated from the snowplow/snowplow mono-repo - what if we extracted each of these as a library, with its own release cadence? What if we turned the webhook adapters into fullblown Java modules?

One thing that will definitely help with the build-out and ongoing maintenance of webhook support will be the addition of schema inference - @anton and @BenFradet are working on the RFC for this and will share in due course.

Look forward to the community’s thoughts on all this!


#3

Avalanche for webhooks - or at least some way of testing them in an easier manner definitely seems like a win.

I think including a preexisting DSL could work and is something I considered (e.g., including the jq binary and mapping schemas to a jq expression).

One thing around this that might meet most of the criteria is to embed Rhino within the collector in a similar way to the enrichment process. A folder (or DynamoDB table) could contain arbitrary Javascript that executes and process/formats/filters a JSON payload that comes through a certain endpoint. This could conform to a generic JSON schema that allows for things like auth/secret management as well e.g.,

{
   "auth": {
       "type": "basic",
       "username": "user",
       "password": "pass"
   },
   "secret": {
      "location": "header",
      "key": "X-Authorization-Header",
      "value": "secret123"
   },
   "transform": {
       "javascript": "ZnVuY3Rpb24gcHJvY2Vzcygpew0KcmV0dXJuIDE7DQp9"
   }
}

#4

Hey @mike - yes, potentially JavaScript, either in the collector or in enrich, or perhaps both: we want to minimize the amount of work done in the collector, but some webhooks will need a small amount of in-collector response behavior.

Another idea is something that’s not Turing complete - given we probably want to do relatively simple transformations (“drop this field”; “change this epoch to ISO8601”). There’s an approximate equivalent to XSLT for JSON called Jolt, which could be interesting:


#5

Just my two cents here:

Pretty much every webhook we’ve wanted to work with has required us to write a custom adapter to forward the data to Snowplow. We’ve always used AWS Lambda (python or node) / API Gateway and it works pretty well. That being said it sometimes requires getting into the nuts and bolts of API Gateway for things like custom authorization and error tracing. Worse than that, we lose the data from malformed webhooks.

IMO it might be overkill to learn a new dsl or Scala if most users are only utilizing webhooks from a handful of 3rd party services. But I still see value in a generalized adapter that could be further customized, if it could be done in a simplistic way or at least in a widely used language like python or javascript…


#6

Right - the challenge for us at Snowplow is, how can we get the adapters that you’ve written back into the core platform Travis, so that the whole community can benefit? That’s one of the chief challenges I want a revised webhook approach to solve for…


#7

Jolt looks quite interesting but does it allow for use cases like converting epoch to ISO8601? The docs sound like they lean towards modifying structure exclusively:

I do like the webhook approach but agree with @alex - we’d need a way to contribute this upstream (possibly with a separate webhook function repository?) and then another method to ensure that there’s no potential for data loss.


#8

I read it as we can add our own transformers to Jolt:

Currently, all the Stock transforms just effect the “structure” of the data. To do data manipulation, you will need to write Java code. If you write your Java “data manipulation” code to implement the Transform interface, then you can insert your code in the transform chain.


#9

That sounds like it would work. Would we just then implement a bunch of generic transforms that can be referenced from the Jolt spec?


#10

Yep, we’d have a library of them that we could use…

What do you reckon @knservis - we’ve been talking around this topic separately as well?