Optimizing AWS Stack for Medium Load Site

kjain · January 27, 2017, 11:09pm

hi there!

This is my first time setting up the Snowplow architecture, after after a few days of testing/trying/failing and a lot of patience, I finally have a running stack! The whole experience taught me a whole lot about its achitecture and AWS infrastructure as well. My kudos to the developers for having built such an amazing platform!

I was hoping for some advice on optimal AWS nodes to run the platform. I am currently using the following:

Scala Collector (on EB m1.small and ELB)
Kinesis streams (1 shard for each stream)
Kinesis-s3 Sink (on t2.medium instance)
EmrEtlRunner (on t2.medium)
EMR (m1.medium)
StorageLoader (t2.medium)
Redshift DB
All are setup inside a 2 subnet VPC (where relevant since I cannot control the EMR clusters’ deployments).

This is working in a test capacity and for that purpose I have Kinesis-s3 Sink, EmrEtlRunner and Kinesis-s3 Sink running on the same t2.medium machine.

So I was wondering what you recommendations would be for specs regarding the above for a medium traffic site (1M to 1.5M events/month) ?
Is there an issue with running all 3 services on the same server? Is it possible to set them up through EB and ASG as well?
Since Kinesis-s3 Sink needs to be continually running (while the other 2 are scheduled), how do you recommend we set this up? Especially considering that Kinesis-s3 Sink is a very critical part of the system…

This setup is for the batch model only - once this is stable, I will be looking into the realtime processor as well

Thanks very much!

grzegorzewald · January 30, 2017, 11:08am

Hi Kjain,

I had no issues running whole RT stack (collector + enrichment + elasticsearch storage + custom storage) on single medium instance with 1,5 - 2 M events/day (of course not in production environment - in such a case you have to divide collectors and make it HA/HR).

Regarding your number of events, I would also schedule S3 storage. Most likely you would be able to process 12h of data in couple hours…

kjain · January 30, 2017, 5:46pm

Thanks for your reply grzegorzewald!

Yes - the 3 services run fine on a single box for dev purposes.

How have you set up your prod environment though? Do you use EB/ELB to run the Kinesis-s3 Sink task?

Considering that the other 2 (EmrEtlRunner and StorageLoader) can be scheduled, how do you manage that? It seems a waste to setup an entire instance just to run this script only once a day (or twice)?

Thanks!

kjain · January 31, 2017, 4:46pm

After some trials, I have settled on setting up an EC2 instance for handling the ENT runner and the storageloader task. I think, based on my requirements, an t2.medium should be able to handle it.
But for maximum availability/reliability, I think it would be best to host the kinesis-lzo-s3 task on an EB instance (with load balanced and ASG enabled). Considering that this is rather critical (it converts the streams into actual logged data) I think it the best approach. But I am not sure how to configure the EB environment to automate installing the prerequisites and auto-running the script. There are some documentation about this on AWS (using the . ebextensions folder) but I not clear.

Any help or suggestions would be very appreciated!

Thanks!

kjain · January 31, 2017, 10:57pm

I have successfully managed to run the kinesis-s3 module through an EB deploy (within a VPC private subnet)! I loosely followed the process outlined for deploying EB for the collector and added a config file to do the following:

install the lzo libraries
download the kinesis jar from a repo
copy my config file
run the jar file.
What would you suggest as as optimal ec2 instance type for this module (for a medium traffic site)?

Thanks!

Topic		Replies	Views
Setting up the real-time pipeline on AWS AWS real-time pipeline	24	5712	May 25, 2021
Snowplow Mini - Two Kinesis sinks to Elasticsearch? Snowplow Mini	3	2004	July 21, 2017
Recent cost information for Snowplow For engineers	8	4686	November 20, 2020
Is this Lambda Architechture possible AWS real-time pipeline	5	2131	November 14, 2016
Migration from batch processing to (near) real-time For engineers	3	837	February 14, 2019

Optimizing AWS Stack for Medium Load Site

Related Topics