I have the entire snowplow pipeline configured and operational and I am now experimenting with optimal scheduling for the EMR process and storage loader. There aren’t really any guidelines on how often to run these processes except for some anecdotal recommendations to try once a day and see what works. I’m doing that but when the job runs it takes a very long time to process all the events, sometimes as long as 1 to 2 days. I have a two part question, the first part is should I deviate from the default ec2 instance type (m1.medium) for the EMR job, and possibly use an ec2 type with a little more CPU power, or is the issue more related to I/O? Secondly, if I can speed up this process (or even if i can’t), would it make sense to run this job more frequently? I am using Jenkins which will allow me to ensure only one job will run at a time, so perhaps it would make more sense to attempt the EMR job once every hour or two? I am running the tracker on a very high traffic site so naturally it generates a lot of events. Thanks in advance for any advice.
Have you tried using spot instances to supplement with processing power? How many m1.medium instances are you running? You can easily change the instance types in your config file.
As far as frequency on the pipeline it’s your preference, you will likely use the same amount of compute resources - so your best bet in my opinion is to optimize your batch pipeline for daily processing and throw more instances at it.
Just to add a few more notes:
m1.mediumis almost always sufficient for the master node (which doesn’t do any heavy lifting)
- Multiples of 3 are good for the core nodes
- A couple of our favorite instance types for core for larger jobs are:
- Always prefer small numbers of beefier boxes - you never want say 12 or 27 of any box type. This is because more instances will generate many more files, and you are likely to hit either Hadoop small files problem or even the Amazon S3 request rate limitations
Remember that you get charged for
number of instances * instance spec * job duration (per the Normalized Instance Hours approximation), so there is no cost saving of having a weedy box doing your job in 9 hours versus 3 beefier boxes doing your job in 3 hours!
Just to add to the above: currently Snowplow does not support adding EBS volumes to EC2 instances used in EMR. That means that you can’t use the latest generation (e.g. c4 or m4 instance types) as core instance types because they wont have sufficient disk to store the data on the cluster.
This should be fixed in a forthcoming Snowplow release - relevant ticket here: https://github.com/snowplow/snowplow/issues/2950
In the meantime we recommend using
c3 instance types in general for core nodes.