Hi,
After bumping the Shredder and the RDBLoader versions to 1.0.0 in our codebase, we triggered the mentioned apps to shred and load 14 million objects (equaly 15GB of data) onto Redshift (one of the runs has a size of 3.7GB with nearly 4.3 million objects which is exeptionally large). We used a single R5.12xlarge instance on EMR with the following configuration to handle the shredding job:
"configurations": [
{
"classification": "spark",
"configurations": [],
"properties": {
"maximizeResourceAllocation": "false"
}
},
{
"classification": "spark-defaults",
"configurations": [],
"properties": {
"spark.driver.maxResultSize": "0",
"spark.default.parallelism": "80",
"spark.driver.cores": "5",
"spark.driver.memory": "37G",
"spark.dynamicAllocation.enabled": "false",
"spark.executor.cores": "5",
"spark.executor.instances": "8",
"spark.executor.memory": "37G",
"spark.yarn.driver.memoryOverhead": "5G",
"spark.yarn.executor.memoryOverhead": "5G"
}
},
Unfortunately, the EMR job failed after 29 hours with the following error:
AM Container for appattempt_1621946407901_0002_000001 exited with exitCode: 15
Failing this attempt.Diagnostics: [2021-05-25 23:07:22.389]Exception from container-launch.
Container id: container_1621946407901_0002_01_000001
Exit code: 15
...
ERROR DFSClient: Failed to close file: /var/log/spark/apps/application_1621946407901_0002_1.inprogress with inode: 16493
java.io.IOException: All datanodes [DatanodeInfoWithStorage[11.222.232.28:9866,DS-768926aa-41b9-4e38-acf6-c67a57cf70e1,DISK]] are bad. Aborting...
Beside, we found another issue in the application log which was a bit concerning:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 45.1 in stage 13147.0 (TID 19308180) can not write to output file: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://****/run=2021-04-30-19-31-38/output=good/vendor=com.snowplowanalytics.snowplow/name=atomic/format=tsv/model=1/part-00045-f677cd81-8444-43cb-9efe-2cd518cec43d.c000.txt.gz
Following up on this issue, we checked the RDBLoader logs and we found many errors similar to this one:
ERROR 2021-05-27 07:40:05.878: Folder [s3://****/run=2021-04-11-10-51-48/] is already loaded at 2021-05-26T12:39:00Z. Aborting the operation, acking the command
It seems for some reason the EMR job is re-shredding the files which were shredded earlier, in the same job. Now, my questions are as follow:
- Why does it seem like that spark is redoing part of the job as if it has no clue it has done it before?
- Why does it take so long to shred this amount of data? Can it be related to the previous question?
- How to avoid the abovementioned situation?
- Is there a place where we can find reasonable spark configs plus EC2 choices for different amount of data to be shredded?