Recommended ec2 instances for EMR ETL Runner in 2018

pocin · September 3, 2018, 9:17am

I was wondering what is the 2018 recommended way of setting up EMR Etl Runner for enriching events from Clojure collector. There is this thread from 2016 Should I use different EC2 instance types for EMR besides the default? but I believe things some things might have changed since.

So what is the current recommended instance type for

master node
core instances
task instances
?

what is the general workflow for figuring out the number of core+task instances? i.e how to recognize that my emr cluster is over/underpowered?

In my case, the gzipped hourlly tomcat logs on s3 are ~2mbs in size (maybe about 15k events/hr?), on average. I think this is quite a small amount.

tclass · September 3, 2018, 9:48am

It depends how often you want to run the EMR job? More data to crunch through takes longer, Let’s say you want to run it once a day (360k events, that’s a very small amount)

master node: The master node is just for coordinating the cluster, it can be a pretty small instance, m4.medium or m4.large should be sufficient
core instances: You always want to use faster instances instead of more instances, I mostly go for 2-3 instances and then m4.large should be enough for your load.
I wouldn’t use task instances at all, it might make sense if you have TB of data to crunch through but I never used them
underpowered/overpowered: If your cluster runs take longer, you should consider using bigger instances, I mostly try to stay around a 1h window. If your cluster only takes 20min then you should consider taking smaller instances, but it depends how fast you need that data. Beware, it’s not really possible to run the cluster < 10min because it already takes 5-10min to setup the EMR cluster itself and there’s not really a way to speed that up afaik.

pocin · September 4, 2018, 10:15am

thanks, that helps alot!

Topic		Replies	Views
Should I use different EC2 instance types for EMR besides the default? AWS batch pipeline (Legacy)	3	3816	December 22, 2016
Increasing execution time on batch mode instance For engineers	9	1215	November 5, 2016
How long is a reasonable run time for EmrEtlRunner? AWS batch pipeline (Legacy)	4	1325	May 12, 2017
EmrEtlRunner sizing	5	1967	June 24, 2019
ETL very very slow in larger batches Troubleshooting	24	5030	January 29, 2018

Recommended ec2 instances for EMR ETL Runner in 2018

Related Topics