Reprocessing Events from Clojure collector


#1

We are attempting to reprocess events (not bad events) from the Clojure collector, and are running into trouble.

A new S3 bucket has been created, and set as the ‘in’ bucket for the EmrEtlRunner config to reprocess a few months worth of raw clojure events from the archive. Now that the clojure files are being renamed per the R91 release (https://snowplowanalytics.com/blog/2017/08/17/snowplow-r91-stonehenge-released-with-important-bug-fix/), we keep receiving the following error:

ERROR FileFormatWriter: Aborting job null. java.io.IOException: Not a file:

The bucket structure after the staging has been completed looks like the following:

run=2018-08-28-12-00-00
i-1/
  var_log_tomcat8_rotated_localhost_access_log.txt1502463662.gz
i-2/
  var_log_tomcat8_rotated_localhost_access_log.txt1502467262.gz

The in setting on the bucket looks like:

s3n://bucket-name/

Any archive folders that were created prior to the R91 release do not experience this issue. Is this the expected behavior?


#2

@digitaltouch, the problem seems to be due to extra folder run=2018-08-28-12-00-00.

By the way, you could place your files into “processing” bucket directly and run EmrEtlRunner with --skip staging property.


#3

@ihor,
Thanks for the quick reply. I was able to come to that conclusion. It does make it rather difficult to reprocess more than one batch since this is the folder structure of the archive step now. Each batch will need to be opened and copied manually into the processing folder due to this.