Reprocessing Events from Clojure collector

digitaltouch · August 30, 2018, 3:11pm

We are attempting to reprocess events (not bad events) from the Clojure collector, and are running into trouble.

A new S3 bucket has been created, and set as the ‘in’ bucket for the EmrEtlRunner config to reprocess a few months worth of raw clojure events from the archive. Now that the clojure files are being renamed per the R91 release (https://snowplowanalytics.com/blog/2017/08/17/snowplow-r91-stonehenge-released-with-important-bug-fix/), we keep receiving the following error:

ERROR FileFormatWriter: Aborting job null. java.io.IOException: Not a file:

The bucket structure after the staging has been completed looks like the following:

run=2018-08-28-12-00-00
i-1/
  var_log_tomcat8_rotated_localhost_access_log.txt1502463662.gz
i-2/
  var_log_tomcat8_rotated_localhost_access_log.txt1502467262.gz

The in setting on the bucket looks like:

s3n://bucket-name/

Any archive folders that were created prior to the R91 release do not experience this issue. Is this the expected behavior?

ihor · August 30, 2018, 7:26pm

@digitaltouch, the problem seems to be due to extra folder run=2018-08-28-12-00-00.

By the way, you could place your files into “processing” bucket directly and run EmrEtlRunner with --skip staging property.

digitaltouch · August 30, 2018, 7:43pm

@ihor,
Thanks for the quick reply. I was able to come to that conclusion. It does make it rather difficult to reprocess more than one batch since this is the folder structure of the archive step now. Each batch will need to be opened and copied manually into the processing folder due to this.

Topic		Replies	Views
"Not a file" error from "[enrich] spark: Enrich Raw Events" For engineers	4	1265	September 9, 2019
Rerunning logs (new to Snowplow) For engineers	2	1332	December 19, 2019
How does EmrEtlRunner determine what the latest logs are in the raw "in" bucket? Enrichment	2	1672	September 8, 2016
Reprocessing Bad Events, EmrEtlRunner Error Troubleshooting	7	1901	August 23, 2017
Enrich Raw Events fails due to "Not a file: hdfs" -- Clojure connector -- EMR ETL Runner Troubleshooting	11	1791	September 27, 2017

Reprocessing Events from Clojure collector

Related Topics