Optimizing AWS Stack for Medium Load Site

hi there!

This is my first time setting up the Snowplow architecture, after after a few days of testing/trying/failing and a lot of patience, I finally have a running stack! The whole experience taught me a whole lot about its achitecture and AWS infrastructure as well. My kudos to the developers for having built such an amazing platform!

I was hoping for some advice on optimal AWS nodes to run the platform. I am currently using the following:

  1. Scala Collector (on EB m1.small and ELB)
  2. Kinesis streams (1 shard for each stream)
  3. Kinesis-s3 Sink (on t2.medium instance)
  4. EmrEtlRunner (on t2.medium)
  5. EMR (m1.medium)
  6. StorageLoader (t2.medium)
  7. Redshift DB
    All are setup inside a 2 subnet VPC (where relevant since I cannot control the EMR clusters’ deployments).

This is working in a test capacity and for that purpose I have Kinesis-s3 Sink, EmrEtlRunner and Kinesis-s3 Sink running on the same t2.medium machine.

So I was wondering what you recommendations would be for specs regarding the above for a medium traffic site (1M to 1.5M events/month) ?
Is there an issue with running all 3 services on the same server? Is it possible to set them up through EB and ASG as well?
Since Kinesis-s3 Sink needs to be continually running (while the other 2 are scheduled), how do you recommend we set this up? Especially considering that Kinesis-s3 Sink is a very critical part of the system…

This setup is for the batch model only - once this is stable, I will be looking into the realtime processor as well :slight_smile:

Thanks very much!

Hi Kjain,

I had no issues running whole RT stack (collector + enrichment + elasticsearch storage + custom storage) on single medium instance with 1,5 - 2 M events/day (of course not in production environment - in such a case you have to divide collectors and make it HA/HR).

Regarding your number of events, I would also schedule S3 storage. Most likely you would be able to process 12h of data in couple hours…

1 Like

Thanks for your reply grzegorzewald!

Yes - the 3 services run fine on a single box for dev purposes.

How have you set up your prod environment though? Do you use EB/ELB to run the Kinesis-s3 Sink task?

Considering that the other 2 (EmrEtlRunner and StorageLoader) can be scheduled, how do you manage that? It seems a waste to setup an entire instance just to run this script only once a day (or twice)?

Thanks!

After some trials, I have settled on setting up an EC2 instance for handling the ENT runner and the storageloader task. I think, based on my requirements, an t2.medium should be able to handle it.
But for maximum availability/reliability, I think it would be best to host the kinesis-lzo-s3 task on an EB instance (with load balanced and ASG enabled). Considering that this is rather critical (it converts the streams into actual logged data) I think it the best approach. But I am not sure how to configure the EB environment to automate installing the prerequisites and auto-running the script. There are some documentation about this on AWS (using the . ebextensions folder) but I not clear.

Any help or suggestions would be very appreciated!

Thanks!

I have successfully managed to run the kinesis-s3 module through an EB deploy (within a VPC private subnet)! I loosely followed the process outlined for deploying EB for the collector and added a config file to do the following:

  1. install the lzo libraries
  2. download the kinesis jar from a repo
  3. copy my config file
  4. run the jar file.
    What would you suggest as as optimal ec2 instance type for this module (for a medium traffic site)?

Thanks!