Rerunning logs (new to Snowplow)

Hi - I’m very new to Snowplow. Two weeks ago I didn’t know it existed, but yesterday I successfully completed an AWS set up. There was a lot of trial and error while I worked through the set up, including several EMR runs failing. (The documentation is fantastic, btw)

I now need to reprocess all of the elastic beanstalk collector logs. I took over the final set up / configuration of Snowplow last week. The tracker and collector had been set up since last June. We have 6+ months of logs that I want to get into our Redshift warehouse. My problem is that I don’t know how to force Snowplow to reprocess all of the EBS logs from last June.

I have tried to make this happen by deleting all of the files in all of the config.yml S3 directories, other than Raw:In, but despite this EmrEtlRunner continues to process the files from the last successful run. I’ve confirmed that the raw logs from June are still present in the Raw:In S3 elastic beanstalk log bucket. I’m not sure what I’m missing, and haven’t been able to find an answer to this question online.

Can some please point me in the right direction?

Thanks,

  • Tony

@datawise, I assume you have set up Clojure collector. Not sure what logs you refer to. The following post might clarify something for you, How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket?.

1 Like

Thanks ihor - That’s exactly what I was looking for, and I should have mentioned that we are using a Clojure connector. Much appreciated.