Should I use trackSelfDescribingEvent or Custom contexts or neither?


#1

Hello,

I’m using Snowplow with Google Tag Manager. I got a tag triggered by user click things and it has the following information:’{{Click Classes}}’, ‘{{Click ID}}’,’{{Click Target}}’,’{{Click Text}}’,’{{Click URL}}’,{{Click Element}}.tagName

I don’t think I can put all these data into a struct event, which has 5 fields and out of them category probably should be hard coded to ‘UI’ and action to ‘click’. So I am considering using trackSelfDescribingEvent or stick with trackStructEvent but add a custom context. I have a lot of questions for either approach.

If I use trackSelfDescribingEvent, can I still have those 5 fields from struct events? Will the values, all the processing, end up at the same place as the 5 fields from trackStructEvent?

If I use trackStructEvent with custom context, will the context be sticky and attached to the future trackStructEvent? If not, why is it called context? Would additional event values be a better name?

For either approach, I have to setup iglu repo with the JSON schema. Without them, my process fail at the enrich step. How is the JSON schema used by the enrich step? It does not seems to be creating additional columns for each field but just one column with the JSON contains everything. Here is an example I saw, notice the JSON part, like the schema were not used to do anything:

web 2018-08-08 22:35:42.306 2018-08-08 22:05:55.271 2018-08-08 22:05:55.060 unstruct 0413e9c3-eaf5-4e3f-9232-ccdb2ffca65e cf js-2.9.0 ssc-0.13.0-kinesis stream-enrich-0.18.0-common-0.34.0 63.64.97.98 969628012 c922b1db-4bf5-444e-8c7a-fdf0ec758650 1 0b1748db-6617-4349-8bca-19596fc44b6d https://v.snackv.com/SpcFAx3lh6 https v.snackv.com 443 /SpcFAx3lh6 {"schema":"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0","data":[{"schema":"iglu:com.google.analytics/cookies/jsonschema/1-0-0","data":{"_ga":"GA1.2.1253457568.1522905155"}},{"schema":"iglu:com.snowplowanalytics.snowplow/geolocation_context/jsonschema/1-1-0","data":{"latitude":47.623260099999996,"longitude":-122.33038560000001,"latitudeLongitudeAccuracy":26,"altitude":null,"altitudeAccuracy":null,"bearing":null,"speed":null,"timestamp":1533765818860}},{"schema":"iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0","data":{"id":"7745e613-5150-46bb-bb22-8279c0d3c406"}},{"schema":"iglu:org.w3/PerformanceTiming/jsonschema/1-0-0","data":{"navigationStart":1533765807237,"unloadEventStart":0,"unloadEventEnd":0,"redirectStart":0,"redirectEnd":0,"fetchStart":1533765807494,"domainLookupStart":1533765807497,"domainLookupEnd":1533765807497,"connectStart":1533765807497,"connectEnd":1533765807796,"secureConnectionStart":1533765807568,"requestStart":1533765807796,"responseStart":1533765808357,"responseEnd":1533765808372,"domLoading":1533765808373,"domInteractive":1533765809003,"domContentLoadedEventStart":1533765809004,"domContentLoadedEventEnd":1533765809030,"domComplete":1533765811998,"loadEventStart":1533765812002,"loadEventEnd":1533765812015,"chromeFirstPaint":1533765808877}}]} {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow/link_click/jsonschema/1-0-1","data":{"targetUrl":"https://itunes.apple.com/us/app/safie/id1320646141","elementId":"","elementClasses":[],"elementTarget":"_blank"}}} Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36 en-US 1 0 0 0 0 0 0 0 0 1 24 931 960 America/Los_Angeles 1920 1080 UTF-8 916 1913 2018-08-08 22:05:55.194 bbfb5db5-74af-47c6-bd91-18cfceefffdd 2018-08-08 22:05:55.137 com.snowplowanalytics.snowplow link_click jsonschema 1-0-1

I used to use Amplitude, which allows you to simply dump a dictionary of event properties that can contain anything. What are the benefit for Snowplow to require JSON schema?

I know I asked a lot of things. Thank you in advance.


#2

@sunshineo, before answering questions about structured or self-describing events, could I ask you why want to go that route in the first place?

You seem to utilize Snowplow authored lick click event, which is implemented as a self-describing event under the hood. It should take care of everything for you. You do not need any custom events here. If you wish more control over what link and how is tracked the method gives you plenty to accommodate that.


#3

Thank you ihor. Yes the unstruct event example I pasted is a link click event, but we want to track all clicks, which Google Tag Manager do provide the following information:’{{Click Classes}}’, ‘{{Click ID}}’,’{{Click Target}}’,’{{Click Text}}’,’{{Click URL}}’,{{Click Element}}.tagName . The title of the post is asking how should we track this.

Even for the link click events that Snowplow automatically tracked, as you pointed out they are “implemented as a self-describing event under the hood”, I do not understand how they can be used when all the data are in one column as a JSON. Compare to a struct event where category, action, label, property, value are all on their own column.

Writing here I realized I may be missing some steps. I’m only looking at the data in our s3, which in my past experience I’ll be start writing map reduce job on them, which makes me want everything to be on its own column. But will these JSON be broken up and put into columns when data goes into Redshift or PostgresSQL?


#4

@sunshineo, indeed the self-describing events and contexts are in JSON format in S3. To load such data into Redshift we use COPY FROM JSON command, which is different from atomic data in TSV format for which COPY is used instead. The result is atomic and custom data land in their own tables. Here’s the link to DDL for “link_click” table.

The last diagram on this wiki page explains the process.

I think I can tell a little about the difference between structured and self-describing events in that respect. By its nature, the structured event has a predefined format (5 properties at max) and a such will be loaded into events table. A self-describing event, on the other hand, will be loaded into a dedicated table as there’s no way to know in advance the size of it to accimodate it in the events table.


#5

For Redshift yes (this processing is referred to as shredding) currently Postgres data is not shredded in the same way.


#6

So what about log a struct event with custom context? Where does the custom context go?


#7

Custom contexts are not much different from unstruct events in that respect. They both are self-describing entities and as such will load in dedicated tables (in Redshift). You would have to join the events and the dedicated table via event_id = root_id and collector_tstamp = root_tstamp relation during the data modeling process.