There were a few times where enriched events were not copied to S3, due to this error:
Exception in thread "main" java.io.IOException: Error opening job jar: /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar at org.apache.hadoop.util.RunJar.run(RunJar.java:160) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.util.zip.ZipException: zip file is empty at java.util.zip.ZipFile.open(Native Method)
Re-running the etl-runner without staging fixes the issue. I found troubleshooting tips here:
I guess it’s difficult to settle such issues easily, but I’m sure most users are using some task scheduler like Jenkins for their snowplow pipeline, so it’s not ideal to be re-running manually as the pipeline is not self-healing. Anyone else had this? Any ideas?
Here’s my config:
emr: ami_version: 4.5.0 region: eu-central-1 jobflow_role: EMR_EC2_DefaultRole service_role: EMR_DefaultRole placement: ec2_subnet_id: subnet-[...] ec2_key_name: my_key bootstrap:  software: hbase: lingual: jobflow: master_instance_type: m4.large core_instance_count: 3 core_instance_type: c3.4xlarge task_instance_count: 0 task_instance_type: c4.large task_instance_bid: bootstrap_failure_tries: 3 additional_info: collectors: format: clj-tomcat enrich: job_name: snowplow ETL versions: hadoop_enrich: 1.8.0 hadoop_shred: 0.10.0 hadoop_elasticsearch: 0.1.0 continue_on_unexpected_error: false output_compression: GZIP