Learnings from using the new Spark EMR Jobs

bernardosrulzon · August 22, 2017, 10:33pm

We just had incredible results splitting the log files to fully utilize the cluster!

We were able to enrich 2.3GB of compressed logs in 7 minutes using 72 similar-sized files, versus 2h11min on the original 24 files. My guess is that splitting the files and equalizing the file sizes (the collector generates a lot more logs during the day vs. the night) both played an important part in this optimization.

Here’s the bash script that does this pre-processing, if anyone is interested in trying it out: https://gist.github.com/bernardosrulzon/f426a7290ad3ebacd6dfee11bb523874#file-snowplow-process-before-run-sh-L12-L14

Cheers!
Bernardo

Topic		Replies	Views
R89 Spark job underutilizing cluster AWS batch pipeline (Legacy)	3	1611	June 27, 2017
Spark memory woes AWS batch pipeline (Legacy)	1	1824	December 14, 2017
Optimizing and reducing shredding/loading costs For engineers	4	881	January 20, 2021
Optimal setup for Spark jobs For engineers	1	916	August 21, 2017
Processing a big file in EMR or split it up? AWS batch pipeline (Legacy)	2	2799	March 17, 2018

Learnings from using the new Spark EMR Jobs

Related Topics