We are trying to predict as far as possible how much disc space we will need for the EMR process, because we want to improve our current cluster provision and forecast our needs for the near future.
We created a script to fetch how many files and how big they are before the wave starts. This script, in short, sums up the values of every file after decompression and then saves this value for future comparison/analysis.
But this value was not close to the numbers found in Amazon’s EMR monitoring tools. Then, we started to investigate how much Snowplow does use inside HDFS. So far, we were able to figure out that Snowplow creates the raw, enriched and shredded directories and how much they consume inside Hadoop. For instance:
2.9 G hdfs:///local/snowplow/enriched-events 1.1 G hdfs:///local/snowplow/raw-events 2.9 G hdfs:///local/snowplow/shredded-events
However, even if we add up the space used by the SO and system files (including Hadoop JARs), we don’t get the value shown on Amazon’s EMR monitoring data.
Are there any directory we are missing? Is there any process where Snowplow creates temporary files on the EMR Cluster that we did not trace? Is there anything file related that Snowplow does during the Elasticity S3DistCp Step: Shredded HDFS -> S3 event that uses some more space before transferring to S3?
Thank you in advance!