Snowplow Serverless


#1

Hi Snowplowers,

I’m excited to introduce you to a project I’ve been working on recently, which I am tentatively naming Snowplow Serverless: an implementation of (a minimal subset of features of) the Snowplow Collector and Enrich components entirely as functions for AWS Lambda, using the Serverless framework.

To give a bit of background, most of my posts on here are based on my work leading the data architecture at Property Finder Group, where we are heavy users of the Snowplow streaming stack.

However, I’ve worked in the charity sector in the past and continue to do occasional pro-bono advisory work with small charities and social enterprises. For these types of organisations, even the most basic Snowplow infrastructure is prohibitively expensive; the cost of a minimal real-time Snowplow deployment with a relational DB is in the order of hundreds of dollars a month, which immediately places it out of reach.

(Snowplow Mini goes part of the way but serves the distinct use case of experimentation for new users, rather than production cost-saving.)

In contrast, a Lambda-based deployment such as this makes it possible to process several million events per month for just a few dollars.

I’m a long way from being a Serverless crusader, and Lambda certainly isn’t for everyone. Nonetheless, cost-saving aside, there are undeniably other benefits of this approach, such as:

  • One-click deployment
  • Seamless scaling
  • Reduced sysadmin overhead
  • Reduced code complexity – much of the functionality of the current Snowplow code, such as concurrency, retries, and so on, is delegated to the Lambda execution engine

This code is currently extremely experimental, implements a very basic set of Snowplow functionality, and is almost definitely not for production use. In particular, the following Snowplow features are not yet supported:

  • Custom Iglu schemas (only Iglu central events are supported)
  • Custom enrichments
  • GeoIP enrichment
  • Webhooks
  • Graceful handling of bad collector requests
  • Graceful handling of Kinesis failures
  • Snowplow monitoring
  • 3rd party cookies (network_userid)
  • Redirects
  • Any sinks other than Kinesis

Nonetheless, I’m pretty happy with it, and think this approach has huge potential, particularly when paired with other serverless AWS features. For example, by forwarding the enriched events stream to Kinesis Firehose, events could be stored in S3 and queried using Amazon Athena for a fraction of a cent per query.

There are, no doubt, other angles on this I haven’t considered, and I’d love to get feedback and thoughts from others in the Snowplow community. This is EXTREMELY experimental at the moment (see the README for details) but I’m happy to take it forward if there is an appetite for it.

Happy Easter!

Adam


#2

This is really cool, Adam, I’m looking forward to hearing more about it!


#3

I recently had similar thoughts about having a simple Lambda endpoint that will receive Snowplow events for small scale deployments.

Very interested to see how this will develop.

Personally I wish you picked a different language for the implementation, but I guess Scala makes it easier to reuse existing Snowplow codebase?


#4

This is very interesting. I think it could be useful for much more than just Charitables. Many companies have a cost and complexity concern as well. If you can get these other features working, it would be revolutionary!


#5

@arikfr correct - I’m personally not a big Scala fan either, but the Snowplow shared libraries are written in Scala and make heavy use of Scalaz and functional paradigms which don’t convert well to Java at all (I tried, it wasn’t pretty)


#6

This is really cool. I had a crack at refactoring the stream collector in Node.js a while ago (so it could run on Azure Functions, GCP Cloud Functions and AWS Lambda) and got most of though not all of the way.

I think you’ve hit the nail on the head regarding the utility of Lambda/serverless - the main things I noticed when building the cloud function (at least for the collector) were:

  • Reasonably high latencies on responses (often > 100ms)
  • For smaller scale sites quite good but for higher volume you hit concurrency limits quickly (most services limit to 1000 invocations/second).
  • Reasonable cost for smaller volumes but gets expensive otherwise
  • Some security limitations - the maximum concurrency for Lambda functions means that the collector is open to very simple denial of service attacks and due to the way that API Gateway performs throttling high load on one API can impact the latency of other unrelated APIs.

Once serverless has dealt with a few of these growing pains I think the collector could be well suited to eventually becoming serverless. I suspect the enricher will also be serverless but is likely to move towards running on something like Apache Beam/Dataflow where having a warm cache for running a variety of enrichments will be a requirement for lower latencies.


#7

Hi Adam,

Thanks for sharing!

I’m building a serverless Snowplow stack too, but based on the CloudFront collector, S3, Firehose and Lambda, deployable with Terraform, which hopefully I’ll be able to share as well.

The Redshift stack could also be made serverless with the release of Redshift Spectrum, so I’m hoping having the full stack being serverless, in Terraform.

Once thing I’m stuck with, how do you extract the pageViewId when webPage context is enabled? I can’t find which URI parameter is used to send the pageViewId with each event.

Best,

Lionel