@trung, are you using Spark enrichment or Hadoop? We have switched to Spark starting from R89 release. The Spark framework differs from Hadoop in that it does most of the processing in memory while Hadoop relies on writing temp data to HDFS. Thus, Spark is more productive as it eliminates “costly” writing to the disk.
In other words, for Spark configuration, you would want to use instances like
r4 that comes with a higher volume of memory. This, however, is not enough for high volume data processing. Spark requires tuning to ensure the resources are used to their maximum (not idle unnecessarily).
It’s hard to know for sure what the best configuration would be just knowing the volume of events. No event is the same; lots depends on the volume and complexity of self-describing events. We have a rough correlation between the size of enriched files and the EMR cluster/Spark configuration. If you provide what the typical payload size of compresed/uncompressed files is we can provide a better-adjusted configuration.
For now, if we simply replace your 2x
c4.large for 1x
r4.xlarge the Spark configuration will be as below
. . .
You can read more about Spark configuration in this post: Learnings from using the new Spark EMR Jobs. In particular, the following spreadsheet could be used to find the optimal configuration: https://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/.