Understanding js tracker fields

Hi,

There are couple of things I am struggling to develop an understanding on:

  1. Understanding enriched fields:
    I am trying to understand the various fields from the js tracker. My aim is to stitch together page visits for a user + stitch actions for each page visit.

I understand at a high level these fields are relevant
event, event_id, user_id (this comes as null always), domain_userid, domain_sessionidx (always comes as 1, no matter if i reopen browser, or open a private browser), and for sequencing probably dvce_created_tstamp

  1. Many enriched fields are coming null for me. These are user_id, user_fingerprint, all geo fields (I am using the iplookup enrichment from maxmind lite city free version), all ip fields, all refr_ fields, all mkt_ fields, all se_ , all tr_, all ti_ , br_name, br_family, br_type, br_renderengine, event_fingerprint, true_tstamp.

My js tracker is as follows:

<script type="text/javascript" async=1>;(function(p,l,o,w,i,n,g){if(!p[i {p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments) };p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1; n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","http://<sp.js serving url>","snowplow"));snowplow('newTracker', 'sp', '<collector end point>', {appId: 'my-app-id',
contexts: {
    webPage: true
}}); snowplow('enableActivityTracking', 30, 10);snowplow('trackPageView');</script>

I have gone through the user docs but unable to follow clearly. Can someone help me out?

Hi there,

With regards to the behaviour you’re seeing in point 1:
If you are running this locally, i.e. localhost, and on http then the cookies will not store as they default to secure . You’ll need to update cookieSecure: false and cookieSameSite: "Lax" if thats the case.

user_id is a field for you to set your own identifier. Use the setUserId method to populate it.

The Snowplow Tracker Protocol and Snowplow Canonical Event Model should help you understand those fields some more.

There is also a wealth of information in the JavaScript Trackers documentation.

For point 2 regarding the enriched fields, you’ll need to configure the enrichments for the Enrich application so it can populate those fields.
The enrichments are here: https://docs.snowplowanalytics.com/docs/enriching-your-data/available-enrichments/

Some of the popular ones are:

How to tell enrich about your enrichments is documented here.

Thanks.

In a thread regarding web data model , i learnt about the table: com_snowplowanalytics_snowplow_web_page_1 which get automatically generated if the web_page context is enabled.

I have that enabled in my js tracker, but strangely I do not see this table. Can you help me here?

Js tracker:

;(function(p,l,o,w,i,n,g){if(!p[i])    {p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[]; p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments) };p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1; n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","path to sp/sp.js","snowplow"));

snowplow('newTracker', 'sp', 'collector endpoint', {
   appId: 'my-app-id',
   cookieDomain: null,
   sessionCookieTimeout: 1800,
   discoverRootDomain: true,
   cookieName: "_sp_",
   cookieSameSite: "Lax",
   cookieSecure: false,
   encodeBase64: true,
   respectDoNotTrack: false,
   pageUnloadTimer: 500,
   forceSecureTracker: false,
   eventMethod: "post",
   bufferSize: 1,
   maxPostBytes: 40000,
   cookieLifetime: 63072000,
   stateStorageStrategy: "cookieAndLocalStorage",
   maxLocalStorageQueueSize: 1000,
   resetActivityTrackingOnPageView: true,
   connectionTimeout: 5000, // Available from 2.15.0
   skippedBrowserFeatures: [], // Available from 2.15.0
   anonymousTracking: false, // Available from 2.15.0
       contexts: {
        webPage: true,
        performanceTiming: true,
        geolocation: true,
        clientHints: true, // Available from 2.15.0
    },
});
snowplow('enableActivityTracking', 30, 10);
snowplow('trackPageView');
snowplow('setUserIdFromCookie', 'sp');

With the set up you shared, you will be getting the extra content information but how it is loaded depends on your set up. The com_snowplowanalytics_snowplow_web_page_1 example is what it will look like in Redshift, it will look slightly different if in Snowflake or Bigquery (where it’ll be a column rather than a table).

If you’re using Redshift and not seeing that then this is likely a misconfiguration of how your loading your data into Redshift or perhaps an old version of RDB Loader. Walk through this guide and make sure you have EmrEtlRunner configured correctly for Shredding and Loading. Also, automigrations are only enabled from RDB Loader R32, so check your version of the RDB Loader. If it’s pre R32 you’ll have to create/migrate the tables yourself (the walkthrough above mentions this).

1 Like

@PaulBoocock So i am using my own table in mysql. In that case, i am dumping the context data as a binary dump in mysql. Is there any guidance on how i can use the analytics sdk to parse the context field natively?

The analytics SDK will parse the whole event into JSON, so one option is to do that before you load, or do that and just pick out the context you need.

If you want to just handle the contexts, then there will be an internal function in the SDK’s code that does the heavy lifting (for example this part of the python SDK code) - so you might be able to use just that function, or pull it out and amend it to your purposes.

Do bear in mind though that we only support consistency in the output of the SDK, so some future release might make breaking changes for the latter approach. If you don’t have too many dependencies and it’s workable, I’d probably aim for transforming the whole event and loading from JSON. :slight_smile:

Yes, I am using the sdk to get event transformed into json. But the context fields are coming up as blobs. (the fields correspond to binary type in my pojo). I wanted to know if there is any support in the scala sdk to transform context fields into its own json, and output that json only. That way I can either output the contexts into a separate table or leave as is depending on use case.

Ah ok I follow now. So the scala SDK has two relevant methods: parse and toJson (docs with examples).

It sounds like what you describe is what I’d expect to come from the parse method - so if you were to stringify the field you’d get a self-describing JSON string. Perhaps using toJson will give you the Json object you need?

Disclaimer - I’m working off a hunch here. If this isn’t leading in the right direction, perhaps you could provide a snippet of how you’re transforming the event?

Here is my snippet for deserializing PerformanceTiming context for example:

public PerformanceTiming map(String s) throws Exception {
                Validated<ParsingError, Event> validatedEvent = Event.parse(s);
                String jsonString = null;
                Event ev = null;
                if(validatedEvent.isValid()) {
                    ev = validatedEvent.toOption().get();
                    jsonString = ev.toJson(false).noSpaces();
                }

                ObjectMapper mapper = new ObjectMapper();
                JsonNode event = mapper.readTree(jsonString);
                JsonNode parseString = event.get("contexts").get("data").get(1).get("data");

                return (mapper.readValue(parseString.toString(), PerformanceTiming.class));

I do not know the alternative to this hack:

JsonNode parseString = event.get(“contexts”).get(“data”).get(1).get(“data”);

The above gives me the output (which seems right):

[navigationStart=1605911064822,redirectStart=0,redirectEnd=0,fetchStart=1605911064823,domainLookupStart=1605911064823,domainLookupEnd=1605911064825,connectStart=1605911064825,secureConnectionStart=0,connectEnd=1605911065086,requestStart=1605911065086,responseStart=1605911067715,responseEnd=1605911067717,unloadEventStart=1605911067717,unloadEventEnd=1605911067717,domLoading=1605911067717,domInteractive=1605911067763,domContentLoadedEventStart=1605911067775,domContentLoadedEventEnd=1605911067785,domComplete=1605911067896,loadEventStart=1605911067896,loadEventEnd=1605911067896,msFirstPaint=,chromeFirstPaint=,requestEnd=,proxyStart=,proxyEnd=]

Ok, so I am guessing this is what I need to do.

  1. Convert entire sdk output to json tree
  2. Read and recompose contexts in specific pojos (performancetiming for example)
  3. Add in foreign keys to context pojos (I think for web page its true timestamp and event id)
    4.write to context tables/columns/drop as per use case

Does the above seems the right/optimal direction to you?

Also will the foreign keys be the same irrespective of context? Example: for performancetiming will the keys still be true timestamp and event id (which as per my understanding will be used to join with main context to generate the web model)?

A longer term question, how to go about generalizing the above to dynamic contexts (as of now i have preset the contexts I want, but every time my js tracker code changes I have to make code changes with the above approach). Is making use of Event inventory a part of solution? (Thanks for that awesome link)

Your plan seems like a good approach to me. However, the thought has just struck me that what you’re describing is actually the job that RDB shredder already does (GitHub and docs).

The shredder flattens out custom events and contexts into their own federated tables for a relational DB like Redshift, which can’t deal very well with JSON.

There’s also the more recent Postgres Loader (docs), which unlike the above works on stream rather than batch - I’m not very familiar with it so I’m not sure if it shreds the data in the same way but there’s a shred module so I’m guessing that’s a good start.

So actually, forking one of those projects, or lending from their approach might actually be worthwhile here.

Also will the foreign keys be the same irrespective of context? Example: for performancetiming will the keys still be true timestamp and event id (which as per my understanding will be used to join with main context to generate the web model)?

Actually it’s event_id and collector_tstamp. The shredder will rename these to root_id and root_tstamp for the child tables (atomic.events will keep event_id and collector_tstamp). true_tstamp isn’t always set, whereas collector_tstamp is (also it’s a reliable source of truth for actual time in the world since it’s set by the collector and unaffected by any client clock).

A longer term question, how to go about generalizing the above to dynamic contexts (as of now i have preset the contexts I want, but every time my js tracker code changes I have to make code changes with the above approach). Is making use of Event inventory a part of solution? (Thanks for that awesome link)

The low-tech approach, which we used to follow, is to enforce that for any new schema, the table must be created in advance of the tracking going live. This approach also required JSONpath files - not sure if that’d be relevant for you.

The shiny new approach is to implement logic in the loaders, which calls Iglu when it finds a new event or context, and creates the relevant table or column. If you do some digging into the two loaders I’ve linked you to already, I’m sure the logic is in there also!

Thanks for the excellent guide @Colm . I can now safely indulge in json parsing adventure to materialize contexts.

I feel adopting the RDB shredder will involve more work at this point (but i could be proven wrong soon :slight_smile: )

Also thanks for clarifying the key ( event_id, collector_tstamp). The web_page context gives an output similar to below:
{schema=iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0, data={id=e8b5c86f-af7d-4d87-aa47-c1c03bb28ea6}}
So if i understand right, I will need to create additional properties for the context jsons (copy event_id and collector_tstamp to child jsons) and then join in the modelling phase with atomic table.

So it seems the id present in the web_page context serves no part in the join step. (Similar understanding through Purpose of the web page context to understand more). My understanding is the web_page context is just to link all actions (watch video, scrolls etc.) within the same pv event. The moment I reload the page, I get a different id (that is the behaviour i see while i test as well).

Can you let me know if the above understanding is right?

Yes, correct, event_id is a joinkey for event level, web page ID is a value to aggregate events per page view (and join at page view level if necessary).