@grzegorzewald, to process (shred) the compressed enriched files with a total size of 2.2 GB, we would recommend using 1x
r4.16xlarge core instance and 1x
m4.xlarge master instance. The might require up to 640 GB of EBS and 20 Spark executors.
To make clearer below is the relevant section with the corresponding settings:
. . .
. . .
. . .
You can start with this config and depending on its performance adjust it. This config should be good enough to process the compressed files of 2.2 GB fast.
I assume you are talking about long data load to Redshift. It might happen due to the infamous eventual consistency of AWS S3. This basically means that AWS S3 API reports that some files are still present while they are not preventing data load to commence.
When it comes to low volume data, it makes sense to keep retrying the data load in a little while. For this reason, the latest releases have built-in functionality to do that which could be skipped with the option
--skip consistency_check. This means, that if accessing files in S3 bucket fails, the EMR job will also fail at data load step (no retries to access shredded files). You would have to resume your job from
rdb_load step later on (hoping the eventual consistency has been resolved by then).
Additionally, depending on your S3 references in
s3a - you might get either empty files or empty directories. We recommend deleting those manually from time to time. At some point (if not cleaned up) they might cause a problem to ETL process as S3DictCp would have to scan ever accumulating number of files.