Processing a big file in EMR or split it up?


#1

Hey,
I have a file that is 28GB big and has 23million events in it. I’m not sure if the best way would be to import the complete file and the EMR splits it up internally or if I should create chunks of let’s say 30k per file and import it.

Does anyone has experience what could make the least problems?
Does anyone knows the instances that can handle that workload? I thought about 4-5 m4.4xlarge instances


#2

@tclass,

If you are on the latest batch releases (R97+), you should be able to get by with 1 x r4.16xlarge core instance. You might need to add up ~640 GB in EBS storage and use the following Spark configuration:

    configuration:
      yarn-site:
        yarn.nodemanager.vmem-check-enabled: "false"
        yarn.nodemanager.resource.memory-mb: "494592"
        yarn.scheduler.maximum-allocation-mb: "494592"
      spark:
        maximizeResourceAllocation: "false"
      spark-defaults:
        spark.dynamicAllocation.enabled: "false"
        spark.executor.instances: "20"
        spark.yarn.executor.memoryOverhead: "3072"
        spark.executor.memory: "20G"
        spark.executor.cores: "3"
        spark.yarn.driver.memoryOverhead: "3072"
        spark.driver.memory: "20G"
        spark.driver.cores: "3"
        spark.default.parallelism: "240"

The master instance type is sufficient to be m4.xlarge.


#3

@tclass

Here’s our experience. We’re typically able to process 10gb of data in an hour with 4 r3.8xlarges core nodes. We haven’t really tried r4s with ebs since r3s have instance storage. I haven’t measured the performance impact of instance vs ebs, but it could be negligible. But there would be advantages to using a beefier box like an r4.16xl.

Our “master” node is an m1.medium. The node does nothing but run a resource manager. There’s no reason to make it any larger than that.

A single 28gb file would definitely be a BAD idea as far as efficiency. For spark jobs, the sweet spot is between 100mb and 1gb. We produce 70mb files. You want enough files to make full use of the parallelism in Spark.

Would be curious to hear others data points (particularly with r4.16xlarge instances).

Thanks
Rick