Increasing EMR Speed


#1

Hi Snowplowers,

I don’t have a lot of hits on my batch pipeline. Im using m1.medium for my EMR job to run daily. It still takes a good 20 minutes.

  master_instance_type: m1.medium
  core_instance_count: 2
  core_instance_type: m1.medium

Do you suggest increasing to m1.large to increasing processing speed? I can’t use t2 can I?

Thanks
Joao Correia


#2

Joao,

  1. As a rule of thumb, trying to use the latest suitable mN (largest N possible) would get you the best performance/cost ratio. Why not a m4.large for instance? Given its short run, spot instances are a good option as well
  2. Can you try launching your cluster with Ganglia? It would show you the usual culprits (bottlenecks)

#3

Unfortunately the bootstrapping of the cluster already takes 5-10min most of the time. You can’t really get the whole pipeline to run below 10-15min in my experience, even if you choose bigger machines. It seems to me like a reasonable solution that if you want real-time data you have to use the real-time pipeline and for the rest you have to wait at least ~20min


#4

@tclass yup. why we have 2 pipelines. realtime and batch set up as well. even if you want to process one event in batch, it’ll take 15-20 mins to complete all the steps in EMR. we run batch hourly and then realtime we load to redshift every 60 seconds.