My EMR jobs failed during enrich process, I try to resume it, however I have to empty enrich/good folder. Process is finished, but there are only few events (e.g. 100 visits instead of 50k) for some dates. I suppose I have done something wrong with resuming enrich process. Is there a way to replay all events starting from some date? I mean data raw events and make them pass all the pipeline?
@sphinks, here’s how to resume correctly depending on the dataflow step the failure took place in and the mode the pipeline runs in, https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.
It is possible to reprocess again depending on your pipeline architecture. However, it is not clear to me what exactly happened here and what state your batch pipeline in. Do you run the pipeline in Stream Enrich mode or pure batch? Typically, the pipeline would keep your data for each intermediate state in the dedicated S3 locations - raw, enriched, and shredded and thus allowing you to resume/reprocess.
@ihor I have resumed failed pipeline job and it finished. But in results put in database too few rows. Now EMR jobs are running as expected. I’m using batch mode and want to reprocess all events in particular date once again starting from very beginning of pipeline (from raw folder on S3). How I can do it?
@ihor or anyone?
@sphinks, you would need to move the files from the archive:raw bucket to the processing bucket and run EmrEtlRunner with
--skip staging option.