Enrich schema resolver did not restart


#1

We had a glitch on AWS for around 15 or 20 seconds that caused DNS lookups to stop working. Enrich was actually unable to write to Kinesis during that time and after short bit Kinesis client re-started; however, it seem that Enrich schema resolver never recovered from that glitch. So for the next little while we saw good data going to bad stream with messages like:

"errors": [
{
"level": "error",
"message": "error: Could not find schema with key iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0 in any repository, tried:\n    level: \"error\"\n    repositories: [\"TGAMIgluRepo [HTTP]\",\"Iglu Client Embedded [embedded]\",\"IgluCentral [HTTP]\"]\n"
},{
"level": "error",
"message": "error: Unknown host issue fetching iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0 in HTTP Iglu repository IgluCentral: iglucentral.com\n    level: \"error\"\n"
}
]

Have anyone experienced this before? How can we prevent it from happening again?
If we were not monitoring bad stream on constant basis we could have had hours of good data go missing.


#2

This may have been caused by the enricher caching those schemas as ‘bad’ when they didn’t resolve during the brief DNS outage. After DNS lookups had recovered these Iglu lookups still would have been hitting the bad cache. The easiest way to evict this cache is to restart the enrichment process.

If you haven’t already I’d also recommend setting Cloudwatch alarms on your bad Kinesis stream (both for excessive and no traffic) which would help alert about this issue in the future.


#4

Hey @mbondarenko,

Another way to avoid sending data to bad due network issues is to use cache TTL. It is available in RT pipeline since R93.