Here’s a good method to start tuning your cluster.
For people running both enrich and shred, we’ll look at the volumes of the enriched archive since the shred job will be the most resource-hungry.
Let’s say that, on average, the archive enriched runs clock at around 2Gb of gzip-compressed files.
Let’s say that we have a compression factor of around 10, that gives us 20Gb of raw data in HDFS.
By default, the HDFS block size is 128Mb on EMR, that’s 160 blocks. We know that we have to reach at least a parallelism of 160 to read everything in parallel because there is a direct mapping between the number of HDFS blocks to read and Spark tasks.
Usually, we will want between 2 and 4 tasks per core. As a result, we need between 80 and 40 cores for our cluster. Let’s go with 40 cores.
At this point, we have to choose the instance type. There are two choices, either c4 (with 1 Gb per core) or r4 (with 7.6Gb per core). In our experience, c4 simply doesn’t cut it memory-wise. As a result, we’ll go with r4.
We’ll pick 6 r4.2xlarge which gives us 48 cores. After that you can refer to the spreadsheet, however be careful that EMR doesn’t leave all the box memory to yarn and actually takes a lot out (e.g. it makes 23Gb available on r4.xlarge box which have 30.5Gb of memory). That’s why we’re specifying
yarn.scheduler.maximum-allocation-mb below (they are the amount of ram available 61440Mb minus 3584Mb for everything that is no spark: the OS, the datanode, etc).
Note that we further tuned
Memory overhead coefficient as 0.15 (gives better results in our experience) and
Parallelism per core as 4 as discussed above in the spreadsheet.
Once we have this baseline configuration running, we can further tune according to the information we gather while monitoring the job.
Final note on gzip, Spark shouldn’t have to pick up gzip files directly as the format is not splittable, i.e. a gzipped file of 10Gb will be processed by a single core. As a result, you need to either convert it to LZO or uncompress it. In the upcoming R97, we have modified the s3-dist-cp step moving the raw files collected by the clojure collector (which are gzipped) to HDFS to uncompress them.
Hope it helps.