Processing logs for a specific time period

Hi guys,

I’m using EmrEtlRunner (Beanstalk/Clojure + S3 + EMR + Redshift) and I had an issue where EMR was running for 2 days (normally takes <20 mins) so I had to terminate it.

Once it was terminated, I ran “snowplow-runner-and-loader.sh” again but had to move files out of the S3 bucket (processing, shredded, enriched) because it throws an error that the folders aren’t empty which is fine.

Anyway, when I it all ran successfully again, I found I was missing a couple of days of data. How would I go about getting that back? I have all the files from processing, shredded and enriched and (I didn’t delete anything, just moved it).

Also, I run it 6 times per day - would running it for part of that day cause duplicating in the atomic.events table?

Thanks!

Cheers,
Tim

It should be pretty easy to recover the missing 2 days of data:

  • Wait till the latest run has fully completed
  • Pause the regular schedule of processing
  • Move the raw files that you had to move out of the S3 bucket back into processing
  • Run the pipeline with --skip staging
  • Confirm the pipeline runs through and loads the missing data into Redshift
  • Un-pause the regular schedule of processing

I’m not sure I understand the question?

Thank you @alex I’ll give it a go.

Regarding the 6 times per day bit, I meant if I copied the raw files back into processing for a time period I had already imported for, would it cause double rows in Redshift for that time period.

Thank you.

Actually I just realised those dates are missing from the raw folder as well. They wouldn’t still be on the collector or anywhere else would they (it’s from a few days ago)? And if so, could you advise how I would pull them?

Hi Tim -

Currently yes, it would load duplicates into Redshift. This will change in the future when we have cross-batch dedupe for Redshift, but this is still a way off.

Afraid not - if staging ran, moved files from your collectors’ S3 bucket to staging, and then you deleted those files from staging, those events are irretrievably gone.

Thanks @alex, that explains it.