Composition of iglu event properties


#1

tl;dr I am trying to use “$ref” and “allOf” in my schema (it passes igluctl lint), but I am not seeing the jsonpaths or sql output I expect when running igluctl static generate.

My use case is that I have 40 events, and growing, for a system that we are migrating to use snowplow. Most of these events have over 30 properties, and most of those are required and common across subsets of events.

Given all that, I am interested in cleaning up my event specs, using the notion of composition. I was looking at https://github.com/snowplow/iglu/blob/master/2-repositories/scala-repo-server/src/main/resources/valid-schema.json as a starting point, and it makes use of allOf and $ref to assemble the schema.

I feel like I am missing something or doing something wrong. Below is an example schema file that highlights what I am doing and the issue I am seeing.

$ igluctl --version
igluctl 0.2.0
$ cat schemas/co.ga/test_event/jsonschema/1-0-0
{
  "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
  "self": {"vendor": "co.ga", "name": "test_event", "format": "jsonschema", "version": "1-0-0"},
  "type": "object",

  "allOf": [
    {"$ref": "#/definitions/global_properties"},
    {"$ref": "#/definitions/item_properties"},
    {
      "properties": {
        "foobar": {
            "type": "boolean"
        }
      }
    }
  ],

  "definitions": {

    "sha1sum": {"type": "string", "minLength": 40, "maxLength": 40, "pattern": "^[0-9a-f]+$"}, 

    "global_properties": {
      "properties": {
        "widget": {
          "type": ["string", "null"],
          "maxLength": 1024
        },
        "widget_hash": {
          "oneOf": [
            {"type": "null"},
            {"$ref": "#/definitions/sha1sum"}
          ]
        }
      }
    },

    "item_properties": {
      "properties": {
        "item_id": {
          "type": "integer"
        }
      }
    }

  },

  "additionalProperties": true
}
$ igluctl lint schemas/co.ga/test_event/jsonschema/1-0-0
SUCCESS: Schema [/Users/jon/Projects/github/generalassembly/snowplow-events/schemas/co.ga/test_event/jsonschema/1-0-0] is successfully validated
TOTAL: 1 Schemas were successfully validated
TOTAL: 0 invalid Schemas were encountered
TOTAL: 0 errors were encountered
$ igluctl static generate --with-json-paths schemas/co.ga/test_event/jsonschema/1-0-0 --force
File [/Users/jon/Projects/github/generalassembly/snowplow-events/./sql/co.ga/test_event_1.sql] was overridden successfully (no change)!
File [/Users/jon/Projects/github/generalassembly/snowplow-events/./jsonpaths/co.ga/test_event_1.json] was overridden successfully (no change)!
$ cat ./jsonpaths/co.ga/test_event_1.json
{
    "jsonpaths": [
        "$.schema.vendor",
        "$.schema.name",
        "$.schema.format",
        "$.schema.version",
        "$.hierarchy.rootId",
        "$.hierarchy.rootTstamp",
        "$.hierarchy.refRoot",
        "$.hierarchy.refTree",
        "$.hierarchy.refParent"
    ]
}

#2

@jon-ga,

I don’t think your JSON schema is in the right format despite passing the validation. Check out this wiki page explaining this further and providing the links to other documents, in particular, explanation of the self-describing JSON.

In summary, the self-describing schema would normally follow the pattern:

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Schema description goes her",
    "self": {
        "vendor": "your_com.acme_company",
        "name": "event_name_goes_here",
        "format": "jsonschema",
        "version": "1-0-0"
    },

    "type": "object",
    "properties": {
    	"property1": {
            "type": "value_type_goes_here",
            {{ other descriptors }}
        },
        "property2": {
            "type": "value_type_goes_here",
            {{ other descriptors }}
        },
        . . . 
    },
    "minProperties":1,
    "required": ["property1"],
    "additionalProperties": {{ false / true }}
}

The list of properties will be added to jsonpaths and DDL as data parameters (holders to the actual event/context properties you send to Snowplow).

You could find lots of example of JSON schemas on Iglu Central.

For comparison, here’s JSON schema describing self-describing schema used in Snowplow:

{
	"$schema" : "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
	"description": "Meta-schema for self-describing JSON schema",
	"self": {
		"vendor": "com.snowplowanalytics.self-desc",
		"name": "schema",
		"format": "jsonschema",
		"version": "1-0-0"
	},

	"allOf": [
		{
			"properties": {
				"self": {
					"type": "object",
					"properties": {
						"vendor": {
							"type": "string",
							"pattern": "^[a-zA-Z0-9-_.]+$"
						},
						"name": {
							"type": "string",
							"pattern": "^[a-zA-Z0-9-_]+$"
						},
						"format": {
							"type": "string",
							"pattern": "^[a-zA-Z0-9-_]+$"
						},
						"version": {
							"type": "string",
							"pattern": "^[0-9]+-[0-9]+-[0-9]+$"
						}
					},
					"required": ["vendor", "name", "format", "version"],			
					"additionalProperties": false
				}
			},
			"required": ["self"]
		},

		{
			"$ref": "http://json-schema.org/draft-04/schema#"
		}
	]

}

Note the above schema is not meant to generate statics (jsonpaths and DDLs).


#3

Hi @ihor, thank you for your reply.

I understand that is how a basic schema is created, but what I am interested in is breaking up the document into more manageable pieces and assemble the event via composition.

What I’m wondering is is it possible to rather than listing ~40 properties, can I list common “groups” of properties? This is superior in my mind because despite still having code copy, it reduces the edit distance between schema files. This means that it is easier to diff the ~40 schemas quickly and only see meaningful differences.


#6

Hi @jon-ga - we don’t support $ref currently for resolving references to property bundles and reducing copy-paste. It’s a nice idea though.

My use case is that I have 40 events, and growing, for a system that we are migrating to use snowplow. Most of these events have over 30 properties, and most of those are required and common across subsets of events.

Are you sure that those shared properties are intrinsic properties of your individual events? Maybe they would be better treated as independent entities and attached to the events as “custom contexts” - which is all supported natively by Snowplow.