EmrEtlRunner running for days at Step "Shred Enriched Events"

Hello,

I am running snowplow EMR ETL Runner 91. I am running this on top of a 10 m4.4xlarge nodes. The job will run up until the “Elasticity Spark Step: Shred Enriched Events” at which it will run for days and never finish. This doesn’t happen every run and their appears to be little to no pattern on when it will run for days. While it runs for days it creates a backlog of files to run through and makes a big hassle.

I click through on the EMR dashboard to go to the stderr logs and I just say days worth of:

18/04/08 18:05:45 INFO Client: Application report for application_1522951903812_0005 (state: RUNNING)
18/04/08 18:05:46 INFO Client: Application report for application_1522951903812_0005 (state: RUNNING)
18/04/08 18:05:47 INFO Client: Application report for application_1522951903812_0005 (state: RUNNING)
18/04/08 18:05:48 INFO Client: Application report for application_1522951903812_0005 (state: RUNNING)

Looking at the stderr logs there appears to be no difference in the logs except for a finished state being reached on a run that doesn’t go on for days.

I don’t believe that my nodes are running out of disk space: 44%20PM

If anyone has run into this issue before and successfully solved it or has any ideas about how to solve it I would greatly appreciate the knowledge!

Hey @frankcash, I think you can find the cause of failure in YARN container logs. Specifically somewhere in containers/application_1522951903812_0005/.

Anton, sorry for the delayed response. Where in AWS would I go to find the logs created by a specific “container”? Thanks!

Historical logs end up on S3 (the bucket will depend on your emr-etl-runner configuration) but can also be browsed in the AWS UI by clicking the ‘Summary’ tab and under ‘Configuration Details’ click the folder icon next to ‘Log URI’ - from here you can navigate to the containers directory and it should contain the logs for each application.

If you’re looking for logs while a run is happening there’s typically a latency of a few minutes between logs being generated from a given application on the cluster and the equivalent logs showing up in S3. You can ssh into the EMR master node and run yarn logs -applicationId application_1522951903812_00005 and this will print the logs to stdout.