Iglu Central is down & associated S3 issues


#1

There is a major Amazon S3 outage which started at approximately 17:43 UTC today. It is still ongoing. The AWS status page is claiming that this only affects us-east-1, however we believe that this is affecting all regions.

You can track the issue here:

https://status.aws.amazon.com/

There are two major impacts on Snowplow:

  1. Iglu Central is served by CloudFront and backed by an S3 bucket in us-east-1. Iglu Central is not currently available, which means that all Snowplow events will fail validation
  2. Snowplow AWS batch pipelines are failing as they attempt to read from or write to Amazon S3; Snowplow AWS real-time pipelines are failing to sink data from Kinesis to Amazon S3

We will update this thread as we learn more.


#2

What’s the downstream impact of 1? Will data be lost?


#3

We are on batch pipeline with Clojure collector…will be interesting to see if Elastic Beanstalk log rotations are retried or if our raw logs are lost for good. Anyone know?


#4

Hi @dean - the impact of 1 is that events will fail validation and be stored in the bad bucket.

We have paused all of our batch pipelines for Managed Service customers to prevent this from occurring. We recommend open source users of Snowplow also pause their batch pipelines until the underlying issue is fixed by Amazon.


#5

Hi @travisdevitt - if the hourly rotation by the Clojure Collector fails, it will try again on the next hour. As long as you provisioned your Clojure Collector instances with sufficient hard disk headroom (so you don’t max out the local disks), you should see the events finally being rotated once the underlying issue is fixed by Amazon.


#6

It looks like some other services have been severely impacted in us-east-1 as well including EFS, EC2, autoscaling and RDS.

If you have a login to the Amazon console you can see this information here with respective updates.


#7

From the AWS Twitter account

https://twitter.com/awscloud

For S3, we believe we understand root cause and are working hard at repairing. Future updates across all services will be on dashboard.


#8

As of 20:42 UTC, Iglu Central is back online (although the AWS service dashboard continues to report S3 and CloudFront issues in us-east-1). We are continuing to investigate what service outages are ongoing.


#9

We believe that there are still issues in writing files to S3.

When this has recovered, you will want to resume any Snowplow pipelines which failed partway through. We strongly recommend deleting the enriched/shredded data belonging to any such partial pipeline run and resuming that run from the start of the EMR stage; this is to recover any events which incorrectly failed validation during that partial run, due to the Iglu Central outage.


#11

Apparently back up now.

02:11 PM PST As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.


#12

Maybe one note for Cloudfront collector users.

It seems the missing logfiles from yesterday appear now in the associated s3 buckets. We noticed a delay between 8 and 12 hours. Hopefully no data will be lost.


#13

Looking at a 24 hour delay here, snow plow in folder is containing files with 2017-02-28 timestamp


#14

Yes, delay is much bigger, and there was a very huge amount of very small sized log files then normal, but at the end it seems nothing is lost…


#15

Thanks @NirSivan and @ecoron for the additional information for CloudFront Collector users!


#16

AWS has published a detailed post-mortem on the outage here:

https://aws.amazon.com/message/41926/