Error on EmrEtlRunner, S3 not empty

I am getting this error

Snowplow::EmrEtlRunner::DirectoryNotEmptyError (Should not stage files for enrichment, s3://my-super-secret-bucket/enriched/good/ is not empty):

It may be because some emr error, don’t know which caused the following error, but now I configured the cronjob to send mails everytime, so the next time I will know why that error happened.

My question is very simple, if I erase the “Good” folder, may I lose all the tracked data since that error ?

Second question, is it possible to restore the emr-runner and re-process the non empty folders ?

Hi @Germanaz0,

You would typically get the “DirectoryNotEmptyError” under 2 circumstances:

  1. The pipeline job was kicked off while the previous run was still in progress
  2. The previous run has failed to result in the events files not archived

The former scenario is a legitimate condition. We don’t want any pipeline runs clashing. All that is required is to wait for the currently running job to complete then you can safely kick off another run from the top.

In the latter scenario, you would have to identify at what step the pipeline failed. There could be quite a few break points to consider. The best is to check the logs to identify the failure point. However, you also could determine it by checking the following buckets whether they are empty.

Here’re a few scenarios:

  1. Failure at “staging” step, problem spinning EMR cluster or while enriching the events:
  • processing is not empty
  • enriched/good is empty
  • shredded/good is empty
  1. Failure during EMR job post “enrichment” step (while copying files to S3) or at “shredding” step:
  • processing is not empty
  • enriched/good is not empty
  • shredded/good is empty
  1. Failure during EMR job post “shredding” step (while copying files to S3) or during archiving raw file :
  • processing is not empty
  • enriched/good is not empty
  • shredded/good is not empty
  1. Failure at data load step:
  • processing is empty
  • enriched/good is not empty
  • shredded/good is not empty
  1. Failure at archiving step post data load:
  • processing is empty
  • enriched/good is either empty or not
  • shredded/good is not empty

To understand the recovery steps in each scenario, please, refer to the BarchPipeline Steps wiki page.

Answering the 2nd question:

if I erase the “Good” folder, may I lose all the tracked data since that error?

No, you won’t loose the events as long as you still have the corresponding events in either your processing bucket or the archived raw events bucket.

Snowplow was designed to be robust and reliable. Safeguarding the events is the primary objective. Again, you can refer to the mentioned wiki page to see how the reliability is achieved. In short, the (event/log) files are moved to processing bucket. Once the events have been enriched (dimension widen) the processing (raw) events get archived (as so do enriched and shredded events post data load).

Going back to your scenario, provided the previous run did fail, you need to determine if the job failed at EMR step or during the data load. I believe the processing bucket would be empty but shredded/good (as well as enriched/good) are not.

Then, If the failure, say, occurred at EMR you can simply delete the enriched/good (and possibly shredded/good) and rerun the EmrEtlRunner with the --skip staging option. However, my guess the EMR job did complete (including “archive_raw” step, hence, the error refers to enriched bucket rather than processing). In this case, the failure took place either before the data got loaded or after (during the final archiving step). Therefore, you could rerun the StorageLoader without either skipping any step (if failed before data loaded) or with --skip download,load option to complete the archiving.

Do check the actual failure reason. You might need to fix the underlying problem before rerunning.

Hopefully, this helps.

4 Likes

Thanks a lot for the helpful reply, it is almost a guide that you wrote, appreciate it.