I don't think the new Spark EMR pipeline is fully utilizing the EMR cluster. I tried to spin up a fairly large cluster (6 core nodes and 25 task nodes), but when I look at the Yarn Resource Manager and the Spark User History console on the cluster and even EC2 monitoring, very few of the nodes are being utilized during a step.
Digging in a little more.
- The jobs have no EMR configuration (the JSON body is empty). I'm not sure if this is to be expected, or an issue on our side?
- Consequently, the maximizeResourceAllocation option is not set. It seems like it ought to be, but I'm not an expert here. Documented here: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html.
On a per node level, I'm running r4.2xlarges (60gb / 8 cores, but when I look at the Spark User History Console, I see
- spark.executor.memory is set to 5120M
- spark.executor.cores is set to 4.
Both of those values are underutilizing the hardware.
Not sure if this a misconfiguration on my part, or a bug?