Snowplow not staging any logs and is not running the EMR jobs


#1

I am trying to run snowplow emr-etl-runner using cloudfront collector. I can se the raw logs in the relevant s3 bucket. But when I try to run the job with ---->
./snowplow-emr-etl-runner --config config/c.yml --resolver config/iglu_resolver.json --debug

I just get the following text —>

[2017-07-06T09:21:34.135000 #2007] DEBUG -- : Staging raw logs...

And nothing else. When I skip the staging process, the EMR is not starting up and I’m just getting this output — >

D, [2017-07-06T09:22:44.957000 #2325] DEBUG -- : Initializing EMR jobflow

I managed to run the process the first time without any hiccups but it is refusing to run now. Please help


#2

are there any logs in shredded good folder? or enriched good folder?


#3

enriched good folder has only the files from the previous run. No file from this month is there.


#4

last resort for me was to do the following if it gets really stuck. i’m assuming at this point you only have files in “processing” and “enriched” folders.

  1. delete all files from enriched good and shredded good.
  2. and do a full run again using skip staging so it just processes the logs in processing folder only.

depends on how big your log files are and worth rerunning from scratch.

you can also try and just run shred at this point since you have enriched files intact as long as you know enriched files are 100% done.


#5

Hi @masterlittle.

This DAG is illustrating different steps of the Snowplow batch pipeline and recovering process for the steps.

If the pipeline fails during the EMR you should determine at which step it happened and take right actions.

As @mjensen’s already said: if you want to run the pipeline from the top (don’t use --skip staging) all buckets should be empty (processing, enriched and shredded good). If you want to skip staging (files are already present in processing) - enriched and shredded good buckets should be empty for the run.

Hope this helps,
Egor


#6

Hi. Thanks for the help. I emptied the buckets and it works fine with one caveat.

The job runs fine from top to bottom except at the last stage when the files need to be moved to the enriched and shredded archive bucket. For some reason this step never takes place. I have to manually copy the files to the archive so that I can run the next job. What could be the issue?

I am not using any additional enrichment or shredder storage. I just want my parsed files in an s3 bucket.