[IMPORTANT ALERT] Iglu Central bug impacting Android event collection over the weekend

Summary

If you use the Android Tracker with the geolocation context explicitly switched on, all of your Android events over the weekend will have erroneously failed validation. You will need to run Hadoop Event Recovery to retrieve them.

Background

On Fri Jun 17 at 12:34pm UTC we pushed R58 of Iglu Central, which contained amongst other things a patch of the schema:

com.snowplowanalytics.snowplow/geolocation_context/jsonschema/1-0-0

Unfortunately the post-patch version of the schema had a showstopper bug in it, causing 100% of self-describing JSONs referencing this schema to fail validation. We fixed this bug as soon as we detected it, by releasing R59 of Iglu Central this morning, Mon Jun 20 at 08:39 UTC.

We are extremely sorry for the error.

Who is affected

The only tracker using this schema is the Android Tracker; all versions of the Android Tracker use this version of the schema, 1-0-0, rather than the newer version, 1-1-0, which was unaffected. The JavaScript and Objective-C Trackers have used the unaffected version 1-1-0 from the start; the ActionScript 3 Tracker does not set the geolocation context.

The Android Tracker does not have the geolocation schema on by default. You have to explicitly enable it per the user guide.

If you have enabled the geolocation context for the Android Tracker, then the affected schema is attached to all events. As a result, all Android events processed by the batch or stream pipeline between these dates will have failed validation:

2016-06-17 12:34pm UTC to 2016-06-20 08:39am UTC

This will likely have translated to a material drop in your event volumes over the weekend.

If you are loading your bad rows into Elasticsearch, you can verify the problem by looking in Kibana for error messages containing the following string:

 "numeric instance is greater than the required maximum (maximum: -90"

How to fix the issue

Because we have patched the affected schema in-place Iglu Central, the problem has been resolved going forwards, however you will need to recover the Android events which erroneously failed validation over the weekend.

You will need to use Hadoop Event Recovery to do this. We are working through this process now and will update this section when we have further guidance.

How we stop this happening again

Operating Iglu Central is a major responsibility, being a critical dependency of Snowplow as well as other software systems. We take this responsibility extremely seriously, but it is clear that our processes for releasing updates to Iglu Central have some serious shortcomings.

It is no surprise that this issue was caused by a schema patch: in-place patches, or mutations, of existing known-to-work schemas are problematic at the best of times; when that schema is available in Iglu Central as a core building block of Snowplow, the potential impact of a bad update is even greater.

The solution to this problem is still under discussion, but it’s highly likely that it will involve:

  • A staging version of Iglu Central, always available
  • Having the ability to run automated tests checking our largest test suites (e.g. Snowplow itself) against the staging version of Iglu Central
  • Only being able to push to (main) Iglu Central if all such tests pass

These ideas are being discussed further in ticket #358, Make deployment dependent on vigorous test suite.

Thanks for the heads up! Will reprocess the weekend events.

Cheers,
Bernardo