Processing folder not empty - but no error on the ETL script!


#1

Hi there!

This is a strange one I think.
I manually ran the etl script: snowplow-emr-etl-runner to process the logs and prepare for loading into Redshift. Since I was gping this to process a large backlog and didnt know long it would take, I decided to run this manually.

After about 8-10 hrs the process completed successfully and I can see the data in etl/processing.

I then manually ran the storageloader and it successfully processed and then stored this data.

But then as I prepared to start another ETL batch I noticed that the etc > processing bucket still contained all the files from the previous sessions (about 10,000 of them). I thought these should have been moved into the archive bucket and since I didn’t see any errors, I don’t understand why are they still here.

I checked the data from some of these files against the data stored in Redshift and verified that this matched with the last processed data.

What could be the issue? I could manually move these files over to the archive/raw bucket but it would be great to get some insight into why this may have happened.

Thanks very much!


#2

Hi @kjain,

I wonder what version of the EmrEtlRunner you were running. If I’m not mistaking some RC (release candidate) version had this issues. You might need to ensure you are using an official release rather than RC. The apps could be obtained from here (http://dl.bintray.com/snowplow/snowplow-generic/).


#3

Thanks for your quick reply ihor!
The version I am currently working with has been extracted from:
snowplow_emr_r83_bald_eagle.zip


#4

@kjain, the only thing that comes to my mind is the EmrEtlRunner failed to archive the raw events for some reason which went unnoticed. Archiving is the last step and is done by Sluice application outside of the EMR cluster (in the release r83 you are using). It doesn’t stop running StorageLoader but prevents you from starting EmrEtlRunner again.

See the dataflow diagram for more details: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps

I believe it’s a one-off issue and could be disregarded. If the files from that run are still in the “processing” bucket you could run the EmrEtlRunner with the --skip staging,emr option to see if the files will be archived. Do make sure you do not clash with any subsequent run you might have started already.


#5

Thanks for the advice Ihor!

I guess I’ll just manually move these files over to the archive/raw bucket (since that seems to be the step skipped) and run another ETL batch to check if it happens again.