Monitoring for failed ETL jobs (batch pipeline)


#1

Hey Snowplowers, I expect that once in awhile our EmrEtlRunner/StorageLoader applications will fail (for instance it happened a few weeks ago when our Elasticsearch cluster was down) and I want to make sure I am notified automatically when that happens.

Is there a way to get an email alert when the EMR job fails using CloudWatch (or something else) ?


#2

If you’re doing this without modifying the EMR job itself there’s a few possible options:

  1. Write a script to run regularly and check your data sinks (Elasticsearch) and grab the maximum date, if that date exceeds a certain threshold then send an email (either direct or via SNS).
  2. Run a different script (using cron or alternative) to try and grab any EMR jobs that have failed recently. To grab clusters that have been marked as failed in the last 6 hours you could use the aws-cli to do something similar to:
    aws emr list-clusters --created-after $(date --date "6 hours ago" +%Y-%m-%dT%H:%M:%S) --failed

#3

If you want to catch failures in EmrEtlRunner/StorageLoader when they occur, safest is to wrap the execution of both apps in a monitoring script.

For example, if you are running it in cron, then cronic is a pretty good monitoring wrapper. At the other end of the scale, if you are using something enterprise-y like Chronos on Mesos, that will have failure notification built-in.

Because a lot of the jobflow of EmrEtlRunner/StorageLoader doesn’t (currently) take place in EMR, it’s really important to capture the full stdout/stderr from a failed run, so you know precisely where to restart the failure from. Without that output, you often have to do some detective work to figure out where to resume from (“I can see data in Redshift but some data still in shredded/good, so presumably the archive of shredded events failed partway through?”).


#4

great suggestions guys, thanks


#5

Ah yes, thanks for that, I’ve edited that now.