It took nearly 2 days to figure this out so thought I better share this in case it helps others.
When we were running the Enrich process, all of the events/logs were ending up in the bad bucket with the following error:
“Access log TSV line contained 33 fields, expected 12, 15, 18, 19, 23, 24 or 26”
We managed to figure out what was causing this.
In our config.yml file we were using
collectors: format: tsv/com.amazon.aws.cloudfront/wd_access_log # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
Changing “tsv/com.amazon.aws.cloudfront/wd_access_log” to instead just be “cloudfront” fixed the issue. Now nearly all of our events end up in the /archive bucket.
We are using JS 2.12 and EmrEtlRunner R119.
The config.yml.sample comments should be updated to reflect this.