Long running time of EmrEtlRunner for very few events


Here I want to understand a bit about EmrEtlRunner. We have two Snowplow Environments. One for production and second for development use. We use second one to try out new events, upgradation mockup, etc.
We don’t run EmrEtlRunner everyday, we do it when we have something to try with new events. It could be once in a week or fortnightly. The amount of events captured will be very very less. Not more than 100 or so.
But than too, the EmrEtlRunner takes more time to complete around 9-10 hours. In contrast, the EmrEtlRunner for production environment takes 3-4 hours, processing millions of events per day.

Here my guess is the time taken by EmrEtlRunner also depends on number of files to be processed and not only number of events. I want to know is my guess correct and what can be done to reduce the time.



Hello @jimy2004king

That is known Hadoop problem. It’s much better suited for lots of big files than any amount of small files. You can find out more on our post, it’s a bit old, but I think everything said there is still valid.