'Serverless' Snowplow architecture


#1

Hi,

The headline’s perhaps misleading, there are servers involved, but I’m trying to make the management of the infrastructure of snowplow as light touch as possible, and with that in mind, I’m trying to write the architecture in terraform, and use SaaS (and inherently scalable) components where I can.

I have the following pipeline in mind. I’ve got as far as the S3 sink.

  1. Scala collect (CodeDeploy in an ASG)
  2. Kinesis stream
  3. Scala enrich (CodeDeploy in an ASG)
  4. Kinesis stream
  5. Kinesis S3 sink (likely a CloudWatch event starting a container every minute, with a large buffer specified, so work won’t be undertaken every minute, only when the buffer’s filled).
  6. File written and S3 event sent to AWS Lambda.
  7. AWS Lambda to start a container running storage-loader.
  8. Load to RedShift.

It mixes batch and the real-time functions, but I haven’t read anything that says this can’t be done at present.

  1. As a novice, I’d like to know if it’s appropriate for the Kinesis S3 sink to be used at this point in the chain? Or is it to be used for un-enriched data only?

  2. Can I please have your comments on any issues you might see with this architecture?

Thanks in advance,

Graham


#2

Hey @Graham-M - interesting architecture!

It mixes batch and the real-time functions, but I haven’t read anything that says this can’t be done at present.

Mixing batch and real-time is fine - after all it’s the basis for our recommended Lambda architecture:

http://discourse.snowplowanalytics.com/t/how-to-setup-a-lambda-architecture-for-snowplow/249

  1. As a novice, I’d like to know if it’s appropriate for the Kinesis S3 sink to be used at this point in the chain? Or is it to be used for un-enriched data only?

It’s not well documented yet, but yes the Kinesis S3 sink works fine with enriched data - just configure it with the gzip output.

Kinesis S3 sink (likely a CloudWatch event starting a container every minute, with a large buffer specified, so work won’t be undertaken every minute, only when the buffer’s filled).

This feels rather complex. If you have a steady stream of events, why not just leave a small Kinesis S3 sink running all the time in an ASG? If you use some kind of scheduled approach, then you have to reason about the schedule, the shards and the sink buffer settings to understand how the component will perform.

AWS Lambda to start a container running storage-loader.

You are missing Hadoop Shred, which currently only runs on EMR. This component sits between the enriched events and the StorageLoader.

Hope this helps,

Alex


#3

This is very interesting @alex. We’re also venturing into​ the real time world and wondered if we could skip the batch pipeline’s EMR and load events straight from the kinesis enriched stream to Redshift. Is that possible/recommended? It seems pointless to enrich the same data twice (stream enrich + EMR), but we would (1) lose the reduplication feature in R89, and (2) have to adapt the storage loader to handle these files correctly.

Edit: Ah, now I got the point that Hadoop Shred currently only runs on EMR. Is porting that component to real time in the product pipeline?

Edit 2: Nevermind. This is all pretty well explained in the Spark RFC (Migrating the Snowplow batch jobs from Scalding to Spark) - some pretty exciting months ahead! Looking forward to it!


#4

Hi @bernardosrulzon - just to fill you and the community in on our progress since the Spark RFC.

We are doing a lot of active R&D on exactly what you describe. In fact this is what’s driving recent releases such as Dataflow Runner v0.3.0 - because we are doing all this experimentation using Dataflow Runner, not EmrEtlRunner:

So far we have managed to drop Hadoop Enrich (eliminating the duplication of effort you mentioned), and we are just running Hadoop Shred on EMR. We’re getting about 2.7 loads per hour - the main limiting factor right now is the EMR bootstrap time, so the next step in our experiments will be using a persistent EMR cluster. We think that this should get us to being able to load Redshift every 10-15 minutes or so, but we will keep you posted.

All the ongoing work around Spark Shred and the StorageLoader port to Scala will complement this nicely - the ultimate goal being to have an RDB (Relational Database) Loader which runs on Spark Streaming and can read the enriched event stream directly. I imagine that might get us down to a load every 5 minutes.

Exciting!


#5

Hi Alex,

Many thanks for your reply, I think I’ll take your advice on the S3 sink, it should be fairly close in ASG to the enrich process, so a bit of copy and paste should do the trick.

Which regards to the Hadoop Shred, that feels like something that might be done periodically too. Can you paste a link to the setup for that portion of the chain?

(edit: Reading the docs, does using stream enrich preclude the need to use the Hadoop Shred via EmrEtlRunner?)

A slightly different question, I currently have all of my enriched data being sent to the ‘enriched-bad’ stream rather than the good stream. I’m not sure how to debug this, and have tried running the enrich on the console to see if there’s a message as to why this is happening, but I can’t see anything that stands out.

I can however see this message in the stream data:

u'Data': '{"line":"CwBkAAA....xLTAtMAA=","errors":[{"level":"error","message":"Querystring is empty: no raw event to process"}],"failure_tstamp":"2017-05-31T09:04:57.546Z"}',

Do you have advice for what this message might mean? Google doesn’t seem to be my friend here!

Thanks,

Graham


#6

Hi @Graham-M - I’d suggest creating a new thread with your bad rows problem.

It’s undocumented at the moment - but it should be fairly easy to figure out. I’d recommend starting with the vanilla Lambda architecture and then going from there (removing Hadoop Enrich).

No, Stream Enrich does event validation and enrichment; Hadoop Shred does enriched event preparation and splitting for Redshift.


#7

Thanks @alex, I feel like I might be biting off more than I can chew with my architecture, it might be time for me to reconsider the batch processing.


#8

Yep, I think starting with batch is the most straightforward path…