Snowplow EMR jobflow error

My EMR jobflow is failing with following error:
Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-3VIIWNRZQ5WCR failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.

It fails before the data enrichment step.

snapshot of my config.yml :

emr:
ami_version: 3.6.0
enrich:
job_name: Snowplow canvas ETL
versions:
hadoop_enrich: 1.5.1
hadoop_shred: 0.7.0
hadoop_elasticsearch: 0.1.0

I am stuck with this problem from a few days and is blocking other business flows. Any help here would be appreciated.

Thanks in advance,
Malathi

The EMR AMI 3.6.0 has been deprecated, I would suggest moving to 3.11.0.

Also, please consider upgrading your Snowplow installation as this one is quite dated.

2 Likes

Thanks!

Could you help me how to upgrade snowplow.

Also, I need a help in moving the past few days data in archive to raw_logs folder.
If it is same file name format, I can move it. But the files in archive and raw_logs are having different file name format. So Could you help in this.

Thanks in advance,
Malathi

Could you help me how to upgrade snowplow.

There is an upgrade guide in the wiki. However, in your case it might be easier to start from scratch from the last version.

Also, I need a help in moving the past few days data in archive to raw_logs folder.
If it is same file name format, I can move it. But the files in archive and raw_logs are having different file name format. So Could you help in this.

Could you describe your problem in more details?

The current location of the logs (After the build failed for past few days) : s3:/my-bucket/archive/2017-12-29/
Format of the file name : ..raw_logs.gz

Format required in raw_logs : .gz

I wanted to know if just moving the logs from archive to raw_logs will do ? or is anything else required ?

No, your best bet is to move the archived files to the processing location, and then run the pipeline skipping staging. This is because all the filename-changing happens in the staging phase (therefore skip staging as that renaming has already been done).

Thanks for the valuable suggestion. By upgrading snowplow, we could get the latest data to redshift.
But our job failed for last 15 days, all those data is in archive folder, but not available in redshift.
It will be great if you can suggest how to get last 15days data (Those when the job failed) in redshift?

We tried --skip staging, but to no effect. Any help here will be appreciated.

Thanks again,
Malathi