How long is a reasonable run time for EmrEtlRunner?

cdimitroulas · May 10, 2017, 7:18pm

I have just ran the EmrEtlRunner locally for the first time just as a test and the full run is taking longer than 1 hour to complete all the job steps on EMR. Is this normal?

What kind of average run time can I except from the runner?

All the best,
Christos

ihor · May 10, 2017, 9:40pm

@cdimitroulas,

The job could run from something like 20 minutes to a few hours. It all depends on

number of events (log files) to process
the size of the EMR cluster

What do you mean by “ran the EmrEtlRunner locally”? It’s expected to run on EC2 as it has to access various AWS services.

On rear occasion, the EMR job might get stuck at some task in which case the cluster would have to be terminated manually. That’s really an extreme case though.

cdimitroulas · May 11, 2017, 8:40am

Sure, I will eventually run it on EC2 but I ran it locally with AWS credentials which gave it access to the necessary AWS services.
It ran with no errors but took 1hr 10 minutes to complete with only a couple of raw events in the “in” bucket.

I ran the EMR using an m1.small instance.

I vaguely remember reading a note about the EmrEtlRunner mentioning that some of the files need to be in separate S3 buckets (not just separate ‘folders’) otherwise the runner will have problems. Is this true and if so which raw/enriched/shredded files need to be in a separate bucket?

tclass · May 11, 2017, 2:55pm

around 1h should be fine, setting up the machine (bootstraping) takes already 5min sometimes and depending on how fast your machines in the cluster are, it can take some time.

leon · May 12, 2017, 9:41am

Hi @cdimitroulas,

I also think an hour or so should be fine. You can speed it up but using faster instances but this will of course increase cost. It all depends on how fast you want to process the data. If you want to run it hourly I would recommend trying to get EMR under half an hour to allow for the data load.

Another thing to consider is that AWS charges by the hour. So a slower instance is cheaper per hour but if it needs e.g. one hour and ten minutes to complete you’re still paying for two hours.

It requires a bit of trial and error in the beginning and it’s very difficult to accurately predict a run time depending on the number of events. While a larger number of events will take a longer time the custom events can have quite some influence on the time the Shredding step takes.

In regards to your question about the separate buckets, is this what you mean:

do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.

Topic		Replies	Views
Recommended ec2 instances for EMR ETL Runner in 2018 AWS batch pipeline (Legacy)	2	1847	September 4, 2018
Long running time of EmrEtlRunner for very few events Enrichment	1	1087	November 14, 2016
EmrEtlRunner Issues - taking too long on step 2 AWS batch pipeline (Legacy)	13	3363	March 29, 2017
Should I use different EC2 instance types for EMR besides the default? AWS batch pipeline (Legacy)	3	3802	December 22, 2016
Increasing EMR Speed For engineers	3	784	December 12, 2018

How long is a reasonable run time for EmrEtlRunner?

Related Topics