Understanding schema validation and caching in Snowplow

Recently we have had some Snowplow users and customers reporting unexpected behaviors around schema validation in Snowplow. This thread is a brief explanation of schema validation and caching to help explain those behaviors.

Schema validation in Snowplow

Components which perform schema validation are:

  1. Hadoop Enrich - which validates unstructured events and custom contexts. Derived contexts which are added to the event by Hadoop Enrich itself (such as with the new API Request Enrichment) are not currently validated by Hadoop Enrich
  2. Hadoop Shred - which validates unstructured events, custom contexts and derived contexts prior to loading into Redshift. Very little fails validation here - typically only any derived contexts added in Snowplow Hadoop Enrich, or in the very rare situation where a schema is changed between Hadoop Enrich and Hadoop Shred running
  3. Stream Enrich - which validates unstructured events and custom contexts. Like Hadoop Enrich, derived contexts which are added to the event by Stream Enrich itself are not currently validated by Stream Enrich
  4. Snowplow Mini - as Snowplow Mini uses Stream Enrich under the hood, the schema validation behavior is the same

The exact specifics of schema validation in Snowplow are out of scope of this guide; we’ll post a separate guide on this in the future.

Schema caching in Snowplow

The four components that perform schema validation above all cache the schemas that they retrieve from Iglu registries.

Remember that a Snowplow event stream can consist of many millions of entities (unstructured events and custom contexts) which must all be validated; without schema caching Snowplow would effectively be launching a denial of service attack against the specified Iglu registries.

Schema caching in Snowplow uses in-memory LRU caches which evict the Least Recently Used schemas in favor of schemas which are being more actively referenced. This prevents the LRU cache from growing to an unlimited size.

Understanding cache scope and lifetime

It’s important to understand the scope and lifetime of the schema caches. These vary by Snowplow component:

Hadoop Enrich & Hadoop Shred

  • There is a cache for each Hadoop worker node - not one cache shared between nodes
  • Although they both run on the same EMR cluster, Hadoop Enrich and Hadoop Shred have independent caches
  • The caches will live for as long as that EMR jobflow step is running - e.g. when the Hadoop Enrich jobflow step completes, the cache is lost

Stream Enrich

  • There is a cache for each instance of the Stream Enrich app - and we recommend running one app per server, so there will effectively be one cache per server running Stream Enrich
  • The cache will live as long as that Stream Enrich app instance is not terminated and restarted (e.g. by a server reboot) - the LRU algorithm means that the cache can happily go on adding and evicting values for many years or months

Snowplow Mini

  • Under the hood a Snowplow Mini instance has a single Stream Enrich app running, so the same rules apply

Where cached schemas can cause problems

In theory Iglu schemas should be immutable, but there are two relatively common scenarios where caching schemas can cause problems:

  1. Late added schemas: if events referencing a schema arrive before the schema has been uploaded into the Iglu registry, then Snowplow will cache that schema as being unavailable for the lifetime of that cache
  2. Patched schemas: sometimes a schema already uploaded to Iglu is found to be incorrect and is therefore patched. This breaks the immutability guarantee around schemas in Iglu, and any Snowplow schema cache will continue to hold the old version of the schema for the lifetime of that cache

Resolving problems with cached schemas

With Snowplow Mini and Stream Enrich, you will need to restart the relevant servers to clear the caches.

With the Snowplow batch pipeline, because the caches are short-lived, things are more straightforward: the next batch pipeline run will re-build the caches from scratch, picking up the latest schemas.

Recovering events which failed validation before the schema caching problem was resolved is out of scope of this guide; we’ll post a separate guide on this in the future.

7 Likes

Hi Alex,

we upgraded our stream enricher to the latest version, and it stopped working because of schema validations. The error message says, the schema could not be found in any iglu central.

We have published our schemas here: http://b-iglu.liadm.com/ but the message is:

Could not find schema with key iglu:com.retentiongrid/content_details/jsonschema/1-0-0 in any repository

But it’s clearly available.

This is our resolver conf:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [
          "com.snowplowanalytics"
        ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },{
        "name": "Iglu LiveIntent",
        "priority": 5,
        "vendorPrefixes": [
          "com.retentiongrid",
          "com.liveintent"
        ],
        "connection": {
          "http": {
            "uri": "http://b-iglu.liadm.com"
          }
        }
      }
    ]
  }
}

Cheers, Chris

You are right, the file is available: http://b-iglu.liadm.com/schemas/com.retentiongrid/content_details/jsonschema/1-0-0

Have you tried:

  • Bouncing the Stream Enrich boxes
  • Confirming the URI is accessible from the Stream Enrich boxes

Yes, the schemas are publicly available and i can fetch them from the enricher boxes. What do you mean by:

Bouncing the Stream Enrich boxes

I mean restarting the box (in case your Stream Enrich cached the schema as not existing before you uploaded it)?

I restarted the service, which did not seem to have an influence. However, after really stop/start the enrichment process, the cache was emptied and i saw those error message pop up. However. Unsuccessful lookups should mabye not be cached :slight_smile:

If we don’t cache unsuccessful lookups, then a single missing schema will slow enrichment to a crawl and launch a DDoS on every Iglu registry in your resolver (because every event will have to make HTTP requests to every registry looking for the schema)…

Killing the instance did the trick, thx.

Ah great! Thanks for letting us know… We are thinking about putting a TTL on cache entries so that a missing schema is re-checked in the registries every hour or so…