This is my first time setting up the Snowplow architecture, after after a few days of testing/trying/failing and a lot of patience, I finally have a running stack! The whole experience taught me a whole lot about its achitecture and AWS infrastructure as well. My kudos to the developers for having built such an amazing platform!
I was hoping for some advice on optimal AWS nodes to run the platform. I am currently using the following:
- Scala Collector (on EB m1.small and ELB)
- Kinesis streams (1 shard for each stream)
- Kinesis-s3 Sink (on t2.medium instance)
- EmrEtlRunner (on t2.medium)
- EMR (m1.medium)
- StorageLoader (t2.medium)
- Redshift DB
All are setup inside a 2 subnet VPC (where relevant since I cannot control the EMR clusters’ deployments).
This is working in a test capacity and for that purpose I have Kinesis-s3 Sink, EmrEtlRunner and Kinesis-s3 Sink running on the same t2.medium machine.
So I was wondering what you recommendations would be for specs regarding the above for a medium traffic site (1M to 1.5M events/month) ?
Is there an issue with running all 3 services on the same server? Is it possible to set them up through EB and ASG as well?
Since Kinesis-s3 Sink needs to be continually running (while the other 2 are scheduled), how do you recommend we set this up? Especially considering that Kinesis-s3 Sink is a very critical part of the system…
This setup is for the batch model only - once this is stable, I will be looking into the realtime processor as well
Thanks very much!