Snowplow Serverless

Hi Snowplowers,

I’m excited to introduce you to a project I’ve been working on recently, which I am tentatively naming Snowplow Serverless: an implementation of (a minimal subset of features of) the Snowplow Collector and Enrich components entirely as functions for AWS Lambda, using the Serverless framework.

To give a bit of background, most of my posts on here are based on my work leading the data architecture at Property Finder Group, where we are heavy users of the Snowplow streaming stack.

However, I’ve worked in the charity sector in the past and continue to do occasional pro-bono advisory work with small charities and social enterprises. For these types of organisations, even the most basic Snowplow infrastructure is prohibitively expensive; the cost of a minimal real-time Snowplow deployment with a relational DB is in the order of hundreds of dollars a month, which immediately places it out of reach.

(Snowplow Mini goes part of the way but serves the distinct use case of experimentation for new users, rather than production cost-saving.)

In contrast, a Lambda-based deployment such as this makes it possible to process several million events per month for just a few dollars.

I’m a long way from being a Serverless crusader, and Lambda certainly isn’t for everyone. Nonetheless, cost-saving aside, there are undeniably other benefits of this approach, such as:

  • One-click deployment
  • Seamless scaling
  • Reduced sysadmin overhead
  • Reduced code complexity – much of the functionality of the current Snowplow code, such as concurrency, retries, and so on, is delegated to the Lambda execution engine

This code is currently extremely experimental, implements a very basic set of Snowplow functionality, and is almost definitely not for production use. In particular, the following Snowplow features are not yet supported:

  • Custom Iglu schemas (only Iglu central events are supported)
  • Custom enrichments
  • GeoIP enrichment
  • Webhooks
  • Graceful handling of bad collector requests
  • Graceful handling of Kinesis failures
  • Snowplow monitoring
  • 3rd party cookies (network_userid)
  • Redirects
  • Any sinks other than Kinesis

Nonetheless, I’m pretty happy with it, and think this approach has huge potential, particularly when paired with other serverless AWS features. For example, by forwarding the enriched events stream to Kinesis Firehose, events could be stored in S3 and queried using Amazon Athena for a fraction of a cent per query.

There are, no doubt, other angles on this I haven’t considered, and I’d love to get feedback and thoughts from others in the Snowplow community. This is EXTREMELY experimental at the moment (see the README for details) but I’m happy to take it forward if there is an appetite for it.

Happy Easter!

Adam

5 Likes

This is really cool, Adam, I’m looking forward to hearing more about it!

I recently had similar thoughts about having a simple Lambda endpoint that will receive Snowplow events for small scale deployments.

Very interested to see how this will develop.

Personally I wish you picked a different language for the implementation, but I guess Scala makes it easier to reuse existing Snowplow codebase?

This is very interesting. I think it could be useful for much more than just Charitables. Many companies have a cost and complexity concern as well. If you can get these other features working, it would be revolutionary!

2 Likes

@arikfr correct - I’m personally not a big Scala fan either, but the Snowplow shared libraries are written in Scala and make heavy use of Scalaz and functional paradigms which don’t convert well to Java at all (I tried, it wasn’t pretty)

This is really cool. I had a crack at refactoring the stream collector in Node.js a while ago (so it could run on Azure Functions, GCP Cloud Functions and AWS Lambda) and got most of though not all of the way.

I think you’ve hit the nail on the head regarding the utility of Lambda/serverless - the main things I noticed when building the cloud function (at least for the collector) were:

  • Reasonably high latencies on responses (often > 100ms)
  • For smaller scale sites quite good but for higher volume you hit concurrency limits quickly (most services limit to 1000 invocations/second).
  • Reasonable cost for smaller volumes but gets expensive otherwise
  • Some security limitations - the maximum concurrency for Lambda functions means that the collector is open to very simple denial of service attacks and due to the way that API Gateway performs throttling high load on one API can impact the latency of other unrelated APIs.

Once serverless has dealt with a few of these growing pains I think the collector could be well suited to eventually becoming serverless. I suspect the enricher will also be serverless but is likely to move towards running on something like Apache Beam/Dataflow where having a warm cache for running a variety of enrichments will be a requirement for lower latencies.

3 Likes

Hi Adam,

Thanks for sharing!

I’m building a serverless Snowplow stack too, but based on the CloudFront collector, S3, Firehose and Lambda, deployable with Terraform, which hopefully I’ll be able to share as well.

The Redshift stack could also be made serverless with the release of Redshift Spectrum, so I’m hoping having the full stack being serverless, in Terraform.

Once thing I’m stuck with, how do you extract the pageViewId when webPage context is enabled? I can’t find which URI parameter is used to send the pageViewId with each event.

Best,

Lionel

1 Like

Hey @Mike, is this something you can share?

Sure!

As a disclaimer I haven’t touched this for 18 months or so and my Node has never been good but hopefully there’s something useful in there as I’d love if someone got some value out of it.

I did something very similar to @li0nel (as an experiment) and wrote about it here:

The cost is crazy low, considering the uptime/data availability/scalability of the system, and I haven’t had to touch it once since deploying.

Thanks @mike, will have a look and if i make any good progress ill share back here
Best
Fred

Hey @jakethomas, any chance you can share the Lambda function code?

We are very interested in your progress too.

hi @fwahlqvist, @Mike7L and all

we followed the experiment @jakethomas did, had to re-invent Lambda part.

Those who interested can check the notes at https://www.ownyourbusinessdata.net/enrich-snowplow-data-with-aws-lambda-function/
We shared lambda function (python) in Git repo linked from the post

2 Likes

An update: we added Terraform script to our repository (link is above in my previous post) to make deployment of the solution quick and easy.

1 Like

This looks familiar :slight_smile: I’m glad you found the (first half of the) pipeline easy to set up and useful. Nice work!

Hi @alevashov

Did you get the Snowplow Serverless Implementation into production?
Thanks in advance.

Hi Stan

We ran it for our business website for several months and also helped setting it for one other business.

Hi @alevashov
Thanks for responding. I assume this was deployed on AWS. Is that right?

How many transactions did your serverless implementation scale up to? That’s really the bonus question. THanks for the Q&A.

Yes, it is AWS

We have open sourced deployment script, check our GitHub repo

I can’t talk about exact number of transactions, but the limits are very high, lambda functions can process a lot.

The best for specific case will be to build and test.