Learnings from using the new Spark EMR Jobs

@BenFradet @alex @rbolkey

We just had incredible results splitting the log files to fully utilize the cluster!

We were able to enrich 2.3GB of compressed logs in 7 minutes using 72 similar-sized files, versus 2h11min on the original 24 files. My guess is that splitting the files and equalizing the file sizes (the collector generates a lot more logs during the day vs. the night) both played an important part in this optimization.

Here’s the bash script that does this pre-processing, if anyone is interested in trying it out: https://gist.github.com/bernardosrulzon/f426a7290ad3ebacd6dfee11bb523874#file-snowplow-process-before-run-sh-L12-L14

Cheers!
Bernardo

6 Likes