AWS EMR Bootstrapping Incident revew

Yesterday night some of our batch pipeline users experienced an issue during EMR cluster bootstap.
The issue manifested itself as EmrEtlRunner failure with following logs:

W, [2020-01-15T16:25:50.618136 #16932]  WARN -- : Job failed. 2 tries left...
W, [2020-01-15T16:25:50.624329 #16932]  WARN -- : Bootstrap failure detected, retrying in 81 seconds...

The reason of this outage is that Maven Central, a registry for Java assets turned off an access to hosted assets over HTTP https://blog.sonatype.com/central-repository-moving-to-https
This change unfortunately remained unnoticed by us and we couldn’t address it before it manifested itself globally. EmrEtlRunner uses Maven Central in order to download Apache Common Codec library to replace a legacy one, bundlded with EMR AMI. After Maven Central closed the HTTP access, bootstrap scripts started to fail, preventing all clusters from starting.

This issue affected users who use transient (non-persistent) AWS EMR clusters with EmrEtlRunner.
It did not affect any GCP pipelines nor real-time AWS pipelines loading data to Snowflake.

It’s worth to mention that this incident impacted legacy pipelines the most. RT pipelines were not affected at all, and this is currently a recommended setup, releases older than R102 received the fix with bigger delay because we have no observability over older pipelines. We encourage all our OSS users to use latest versions of Snowplow pipeline.

Timeline

  • 4:00 PM UTC our Support Engineers noticed several failures across batch pipelines
  • 5:00 PM UTC we identified the issue and started to prepare a hotfix
  • 6:00 PM UTC we prepared and rolled out snowplow-ami5-bootstrap-0.1.0.sh hotfix for all regions except ap-southeast-2. This script used by all Snowplow R102+ releases
  • 6:50 PM UTC we noticed ap-southeast-2 was missed and fixed it as well. At this moment all our managed pipelines were fixed and recovered
  • 9:07 PM UTC we received first report from OSS users, telling us that their pipeline is still failing, which was due an old EmrEtlRunner, which is not longer in use inside Snowplow
  • 11:00 PM UTC we rolled out snowplow-ami4-bootstrap-0.1.0.sh hotfix, which unfortunately fixed only pre-R82 EmrEtlRunners
  • 11:00 AM UTC we rolled out snowplow-ami4-bootstrap-0.2.0.sh hotfix, which fixed remaining EmrEtlRunners

What’s next

We’re planning to do an exhaustive audit of our components in order to find even slightest dependencies on 3rd-party data/service providers and exclude as many of them as possible.

4 Likes

Thanks anton.

What changes (if any) do we need to make to our configurations to deploy the fix?

@iain, no changes are expected in your configuration file. Depending on the version of the EmrEtlRunner you run the bootstrapping script will be initiated. We have fixed those scripts to access Maven repository as per the latest changes announced.

1 Like

Thanks for the details, the transparency around this is appreciated by everyone.

I’m looking to move to the streaming architecture soon since batch is now deprecated. In the meantime, what is the current set of versions to use for this? I having been searching and not quite sure what combination is recommended for these:

  • emr ami version?
  • spark_enrich version?

For example with emr r117 biskupin (I know r118 is on the releases list but it is not yet available via bintray so I am ignoring it)

This sample config.yml specifies
emr:
ami_version: 5.9.0

The latest emr is 5.29.0 and all the versions of emr from 5.27.0 supports Spark 2.4.4

The r117 release blog shows an example where the spark_enrich version is 1.19.0

enrich:
  version:
    spark_enrich: 1.19.0

But then directly below that it links to the 1.18.0 jar
" or directly make use of the new Spark Enrich available at:

s3://snowplow-hosted-assets/3-enrich/spark-enrich/snowplow-spark-enrich-1.18.0.jar."

I am assuming it should be 1.19.0 as per the github release notes

Hi @davehowell,

It seems it was an oversight in a blog post. Latest Spark Enrich version is indeed 1.19.0 and latest AMI is 5.9.0. We’ll fix this in the blog post.

However, since you mentioned you’re planning to move to the streaming architecture, I just wanted to raise that Spark Enrich is a part of batch architecture and already deprecated in R118. In streaming architecture on AWS, Stream Enrich is responsible for enrichment and RDB Shredder is the last component that requires EMR. You can have a look at this post that highlights differences and provides step-by-step migration guide.