This is a strange one I think.
I manually ran the etl script: snowplow-emr-etl-runner to process the logs and prepare for loading into Redshift. Since I was gping this to process a large backlog and didnt know long it would take, I decided to run this manually.
After about 8-10 hrs the process completed successfully and I can see the data in etl/processing.
I then manually ran the storageloader and it successfully processed and then stored this data.
But then as I prepared to start another ETL batch I noticed that the etc > processing bucket still contained all the files from the previous sessions (about 10,000 of them). I thought these should have been moved into the archive bucket and since I didn’t see any errors, I don’t understand why are they still here.
I checked the data from some of these files against the data stored in Redshift and verified that this matched with the last processed data.
What could be the issue? I could manually move these files over to the archive/raw bucket but it would be great to get some insight into why this may have happened.
Thanks very much!