Introductory guide to creating your own self-describing events and contexts [tutorial]


#1

Snowplow supports a large number of events “out of the box” (first class citizens), many of which are common in a web or mobile analytics context. But if you wish to track an event that Snowplow does not recognise as a first class citizen then you can track them using custom structured and self-describing events. In this guide, we will talk you through how to track your own self-describing events.

Before you can send your own event and context types into Snowplow (using the track unstructured events or track self-describing events and custom contexts features of Snowplow), you need to:

  1. Identify what and when you will track
  2. Add code for tracking
  3. Define a JSON Schema for each of the events and context types
  4. Upload those schemas to your Iglu schema registry
  5. Define and upload a corresponding JSON Paths file
  6. Create a corresponding Redshift table definition and create this table in your Redshift cluster

Let’s get started.

1. Identify what and when you will track

Initially, you need to identify what events you want to track and when the events are fired. In this guide we will review a simple example: we have different images on our web page and the images are resizing when onmouseover and onmouseout actions occur. Below is the corresponding HTML code:

....
<p><img id="First"  onmouseover="smallImg(this)" onmouseout="normalImg(this)" border="1" height = "100px" width = "100px" src="pic1.jpg"></p>
<p><img id="Second" onmouseover="smallImg(this)" onmouseout="normalImg(this)" border="1" height = "100px" width = "100px" src="pic2.jpg"></p>
	
<script type="text/javascript">
	function smallImg(x) {
		x.style.height = "50px";
	}

	function normalImg(x) {
		x.style.height = "100px";
	}
</script>
....

As you can see we have two different images and we want to track both actions on the images. Now we need to modify our code and add necessary track methods.

2. Add code for tracking

For these goals, we will use trackSelfDescribingEvent method (the new name for the deprecated trackUnstructEvent method). Technical documentation you can find here. Below is the updated code - the definition of the normalImg function will be skipped in all the following excerpts as it has the same functionality:

...	
<script type="text/javascript">
	function smallImg(x) {
		x.style.height = "50px";
		// track our event
		var data = {}
		window.snowplow('trackSelfDescribingEvent', {'schema': 'iglu:com.example_company/onmouse_img/jsonschema/1-0-0', 'data': data});
	}
    ...
</script>
...

Now let’s flesh out the fields which we will be sent when the events are fired:

...	
<script type="text/javascript">
	function smallImg(x) {
		x.style.height = "50px";
		// track our event
		var data = {
			"imgId": x.id,
			"imgSrc": x.src,
			'imgEvent': "smallImg"
		}
		window.snowplow('trackSelfDescribingEvent', {'schema': 'iglu:com.example_company/onmouse_img/jsonschema/1-0-0', 'data': data});
	}
    ...
</script>
...

Our tracking code is ready. But in order to start sending the new events into Snowplow, we first need to define a new JSON Schema for the events. The schema will be used for data validation during the enrichment step of the EMR process.

3. Define a JSON schema for each of the events/context types

Below you can see the schema for our example:

{
	"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
	"description": "onmouse_img example",
	"self": {
		"vendor": "com.example-company",
		"name": "onmouse_img",
		"format": "jsonschema",
		"version": "1-0-0"
	},
	"type": "object",
	"properties": {
		"imgId": {
			"type": "string"
		},
		"imgSrc": {
			"type": "string"
		},
		"imgEvent": {
			"enum": ["smallImg", "normalImg"]
		}
	},
	"required": ["imgId", "imgSrc", "imgEvent"],
	"additionalProperties": false
}

The parameters in the self section will make up the URI of the JSON Schema within the Iglu schema registry in the form: iglu:vendor/name/format/version.

iglu:com.example_company/event_name/jsonschema/1-0-0
---- ------------ ---------- ---------- -----
  |        |           |     |          |- schema version (model-revision-addition)
  |        |           |     |- schema format
  |        |           |- event name
  |        |- vendor of the event
  |- schema methodology

The parameters in the properties section define possible data fields and types. Also, you can define a length for the strings and max/min values for the numbers.

The parameters in the required section define required fields which must present in your events. If one of the required arguments is missed, the event will fail validation and end up in the enriched/bad folder.

additionalProperties is a boolean value which allows sending additional fields with the events, but these fields won’t be loaded at the next steps.

Once the schema is ready, we need to make sure that it has a valid format. You can validate it using igluctl, which facilitiates most of the common schema-related tasks. More information how to download and use igluctl can be founnd here. The lint command for the example:

$ ./igluctl lint ~/schemas/com.example-company/onmouse_img/jsonschema/1-0-0
SUCCESS: Schema [/home/username/schemas/com.example-company/onmouse_img/jsonschema/1-0-0] is successfully validated
TOTAL: 1 Schemas were successfully validated
TOTAL: 0 invalid Schemas were encountered
TOTAL: 0 errors were encountered

4. Upload those schemas to your Iglu schema registry

The schema must be available during the enrichment process. For this, you should download the schema to your own Iglu schema registry.

We will use static registry hosted on Amazon S3, as it’s the simplest and fastest approach. First, we created a new bucket for these goals and hosted static website as per these instructions - you can do it through web interface or AWS CLI.

Next, we have uploaded our schema into the created bucket - you can do it through web interface, AWS CLI or igluctl’s own s3cp command.

Once your schemas are uploaded you need to modify your resolver.json configuration file by adding your own Iglu schema registry so that the EmrEtlRunner knows where to find the JSON Schemas for your custom events/contexts. Below you can find the resolver.json file for our example:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },
     {
       "name": "Your own Iglu",
       "priority": 1,
       "vendorPrefixes": [ "com.example-company" ],
       "connection": {
         "http": {
           "uri": "http://com-example-company-iglu.s3-website-us-west-2.amazonaws.com/schemas"
         }
       }
     }
    ]
  }
}

Now, the enrichment part is configured. You are able to track and validate your custom events/contexts.

If you are running the Snowplow batch flow with Amazon Redshift, then you should create and download corresponding JSON Paths and SQL table definition files to be able to load these events into Redshift tables.

5. Define and upload a corresponding JSONPaths file

This can be done programmatically with Igluctl. Below is the command for our example:

$ ./igluctl static generate ~/schemas/com.example-company/onmouse_img/jsonschema/1-0-0 --with-json-paths --output ~/
File [/home/username/sql/com.example-company/onmouse_img_1.sql] was written successfully!
File [/home/username/jsonpaths/com.example-company/onmouse_img_1.json] was written successfully!

A corresponding JSON Paths file and SQL table definition file will be generated in the appropriate folder in the repo. We used --output option to create the folders in the home directory, by default it will be a current directory. Here is the generated JSON Paths file for our example:

$ cat ~/jsonpaths/com.example-company/onmouse_img_1.json 
{
    "jsonpaths": [
        "$.schema.vendor",
        "$.schema.name",
        "$.schema.format",
        "$.schema.version",
        "$.hierarchy.rootId",
        "$.hierarchy.rootTstamp",
        "$.hierarchy.refRoot",
        "$.hierarchy.refTree",
        "$.hierarchy.refParent",
        "$.data.imgEvent",
        "$.data.imgId",
        "$.data.imgSrc"
    ]
}

Next, upload these JSONPaths file to your iglu repository in separated directory jsonpaths and modify your config.yml file by adding jsonpath_assets:

...
  s3:
    region: "eu-west-1"
    buckets:
      assets: "s3://snowplow-hosted-assets"
      jsonpath_assets: "s3://com-example-company-iglu/jsonpaths"
...

6. Create a corresponding Redshift table definition and create this table in your Redshift cluster

The corresponding table definitions have been created at the previous step. Below you can see the file in our case:

$ cat ~/sql/com.example-company/onmouse_img_1.sql 
-- AUTO-GENERATED BY igluctl DO NOT EDIT
-- Generator: igluctl 0.2.0
-- Generated: 2017-08-22 20:38

CREATE SCHEMA IF NOT EXISTS atomic;

CREATE TABLE IF NOT EXISTS atomic.com_example_company_onmouse_img_1 (
    "schema_vendor"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_name"    VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_format"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_version" VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "root_id"        CHAR(36)      ENCODE RAW       NOT NULL,
    "root_tstamp"    TIMESTAMP     ENCODE LZO       NOT NULL,
    "ref_root"       VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "ref_tree"       VARCHAR(1500) ENCODE RUNLENGTH NOT NULL,
    "ref_parent"     VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "img_event"      VARCHAR(9)    ENCODE LZO       NOT NULL,
    "img_id"         VARCHAR(4096) ENCODE LZO       NOT NULL,
    "img_src"        VARCHAR(4096) ENCODE LZO       NOT NULL,
    FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)
)
DISTSTYLE KEY
DISTKEY (root_id)
SORTKEY (root_tstamp);

COMMENT ON TABLE atomic.com_example_company_onmouse_img_1 IS 'iglu:com.example-company/onmouse_img/jsonschema/1-0-0';

It is essential that any new tables you create are owned by the user which you’re using for StorageLoader to download the data. Once you create your new tables you should assign ownership of it to your_storage_loader_user. The statement for our example:

ALTER TABLE atomic.com_example_company_onmouse_img_1 OWNER TO storageloader;

And that’s it! We are ready for our next Snowplow pipeline run.

7. Summary

Once you have gone through the above process, you can start sending data that conforms to the schema(s) you’ve created into Snowplow as unstructured events/custom contexts.

As a next step, why not try coming up with a specific self-describing event or context that makes sense for your business, and try setting that up for Snowplow? We always recommend testing new schemas with a non-production Snowplow Mini instance first.

Let us know any questions or comments in the thread below!