I’ve gathered the logs from both regions to an intermediate bucket on the same location as the EMR.
During that process I had to rename the files (see below) so they don’t get overwrited, because I’ve got the exact same log files both from ap-southeast-1 and eu-west-1 region.
resources/environments/logs/publish/e-smgk4gppuv/i-1a5c92f8e77605a3d _var_log_tomcat8_rotated_localhost_access_log.txt1506423661 -> _var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.gz
Then setted the IN bucket to that intermediate location and started EmrEtlRunner (and that appended region and bucket folder):
MOVE tracking-snapshots/events/snowplow-raw/_var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.gz -> snowplow-bucket-data/processing/_var_log_tomcat8_rotated_localhost_access_log.2017-09-26-12.eu-west-1.i-1a5c92f8e77605a3d.txt.eu-west-1.snowplow-raw.gz
Which probably caused this error:
D, [2017-09-26T17:32:18.692000 #9405] DEBUG -- : EMR jobflow j-3QMDJ1Q1LCRTR started, waiting for jobflow to complete...
F, [2017-09-26T17:42:20.277000 #9405] FATAL -- :
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-3QMDJ1Q1LCRTR failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2017-09-26 17:38:51 UTC - ]
- 1. Elasticity Scalding Step: Enrich Raw Events: COMPLETED ~ 00:01:57 [2017-09-26 17:38:56 UTC - 2017-09-26 17:40:53 UTC]
- 2. Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED ~ 00:00:14 [2017-09-26 17:40:53 UTC - 2017-09-26 17:41:08 UTC]
- 3. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
- 4. Elasticity Scalding Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
- 5. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]):
Edit: EMR logs can provide more info
Input path does not exist: hdfs://ip-172-31-28-180.eu-west-1.compute.internal:8020/tmp/5a58078b-a2b2-4f4d-a0b8-8b90bf70e3bc/files
Any suggestion to bypass this issue?
How can files from different regions with the same timestamps live together?