Enrich schema resolver did not restart


#1

We had a glitch on AWS for around 15 or 20 seconds that caused DNS lookups to stop working. Enrich was actually unable to write to Kinesis during that time and after short bit Kinesis client re-started; however, it seem that Enrich schema resolver never recovered from that glitch. So for the next little while we saw good data going to bad stream with messages like:

"errors": [
{
"level": "error",
"message": "error: Could not find schema with key iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0 in any repository, tried:\n    level: \"error\"\n    repositories: [\"TGAMIgluRepo [HTTP]\",\"Iglu Client Embedded [embedded]\",\"IgluCentral [HTTP]\"]\n"
},{
"level": "error",
"message": "error: Unknown host issue fetching iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0 in HTTP Iglu repository IgluCentral: iglucentral.com\n    level: \"error\"\n"
}
]

Have anyone experienced this before? How can we prevent it from happening again?
If we were not monitoring bad stream on constant basis we could have had hours of good data go missing.


Enrich errors from iglucentral
#2

This may have been caused by the enricher caching those schemas as ‘bad’ when they didn’t resolve during the brief DNS outage. After DNS lookups had recovered these Iglu lookups still would have been hitting the bad cache. The easiest way to evict this cache is to restart the enrichment process.

If you haven’t already I’d also recommend setting Cloudwatch alarms on your bad Kinesis stream (both for excessive and no traffic) which would help alert about this issue in the future.


#4

Hey @mbondarenko,

Another way to avoid sending data to bad due network issues is to use cache TTL. It is available in RT pipeline since R93.


#5

Thanks for confirming! @mike his is pretty much the set up we have (Cloud-watch on Kinesis stream) that is why we caught it early but we did not restart for good 90 minutes as we thought it should recover from this automatically. That was wishful thinking and we had to reprocess those 90 minutes of data as a result.

@anton Thank you for the tip! We do need to upgrade to take advantage of that feature and we will be upgrading soon as a result of the glitch.