Working with a partial fork of Snowplow


#1

I have set up custom events and uploaded their schemas to an S3 bucket as per the documentation. Here is my resolver file:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },
      {
          "name": "Rocketmiles",
          "priority": 0,
          "vendorPrefixes": ["noonu"],
          "connection": {
              "http": {
                  "uri": "http://snowplow-rocketmiles-iglu-schemas.s3-website-us-east-1.amazonaws.com"
                }
          }
      }
    ]
  }
}

However I get this error in the “enriched/bad” folder:

"errors":[{"level":"error","message":"Payload with vendor noonu and version tp2 not supported by this version of Scala Common Enrich"}]

shouldn’t the vendor prefix and vendor on my schemas take care of this?


POST data from CloudFront Collector
#2

Hi @dyerw,

Could you provide an example of the data (custom event) as you send it to the collector, please?

Regards,
Ihor


#3

Sure. Here’s the Request payload captured from the site. Is this what you meant?

{
   "schema":"iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-3",
   "data":[
      {
         "e":"ue",
         "ue_pr":"{\"schema\":\"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0\",\"data\":{\"group\":\"rm.accounts\",\"name\":\"loginFailure\",\"message\":\"unset user\",\"source\":\"loginModal\"}}",
         "ue_n":"loginFailure",
         "ue_g":"rm.accounts",
         "tv":"js-2.5.3-rm-custom",
         "tna":"cf",
         "aid":"www",
         "p":"web",
         "tz":"America/Chicago",
         "lang":"en-US",
         "cs":"UTF-8",
         "f_pdf":"1",
         "f_qt":"0",
         "f_realp":"0",
         "f_wma":"0",
         "f_dir":"0",
         "f_fla":"1",
         "f_java":"0",
         "f_gears":"0",
         "f_ag":"0",
         "res":"1280x800",
         "cd":"24",
         "cookie":"1",
         "eid":"4e6e27fb-c8d1-4f60-ab4b-4ab1f7f7476e",
         "dtm":"1463498985501",
         "co":"{\"schema\":\"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1\",\"data\":[]}",
         "vp":"577x705",
         "ds":"577x4267",
         "vid":"19",
         "sid":"d56c96e7-3684-46ba-9792-f5fdac0e586e",
         "duid":"51a948373611c24d",
         "fp":"2150806351",
         "rmid":"37A046CE842A0BD8BC8CA24A17CD552B-n1",
         "refr":"https://www.rocketmiles.com/",
         "url":"https://www.rocketmiles.com/?language=en"
      }
   ]
}

also, i’m trying to analyze analytics someone else set up the tracker/collector for, so if something obvious is out of place let me know


#4

Hi @dyerw - this is interesting:

"tv":"js-2.5.3-rm-custom"

I read this as someone at RocketMiles as having set up a custom tracker, based on the Snowplow JavaScript Tracker 2.5.3 but customized. I can see three main problems with the implementation:

  1. Someone has decided to push events into Snowplow using the collector path /noonu/tp2, rather than /com.snowplowanalytics.snowplow/tp2 (read this as “Snowplow Analytics Ltd, Tracker Protocol v2”). Obviously noonu isn’t a payload vendor that a vanilla Snowplow pipeline recognizes
  2. That someone has added new payload parameters like ue_n and ue_g to the JSON, which will cause Snowplow’s validation against this schema to fail
  3. The event (in the ue_pr field) is being passed in without a self-describing wrapper. This will make it impossible for Snowplow to validate the event and shred it through into Redshift or Elasticsearch

I’m sorry the diagnosis is not great - it would be worth going back to whoever implemented the customized tracker and finding out what their plan was…


#5

Unfortunately, a lot of the work was done by someone who is out of the office for quite a while =/. I suspect they didn’t intend to use the built in emr runner and storageloader provided by snowplow, but obviously I’d like to get it working. It’d be one thing to change our front end code to behave better with the rest of the snowplow ecosystem, but we have a ton of analytics we need access to.

I think the ue_n and ue_g params are “unstruct event name” and “unstruct event group”

What’s the best way to start trying to either fix this data or get the snowplow tools to accept it?


#6

I think you are right - I expect the someone was planning to write and maintain a fork of Scala Common Enrich and either Scala Hadoop Enrich or Stream Enrich, depending on whether you are using batch or real-time; maybe also StorageLoader or Kinesis Elasticsearch Sink depending on which storage target you are planning on using.

What’s the best way to start trying to either fix this data

The timing is quite good as we are working on Snowplow R81, which reboots our Bad Rows Hadoop job into a more general-purpose Event Recovery job, which can be fed an arbitrary JavaScript function to fix bad incoming events.

We are testing this with a customer currently and hopefully can share a tutorial for using it on Discourse prior to the official release.

Edit: actually the functionality coming in R81 is close to what’s needed, but it would be better to use this functionality if it existed.


#7

Hi Alex,

Thanks for replying.

We would like to move back to the un-forked SnowPlow tracker, but I found out that the reason we forked it was so that we could add our own session ID to every event. Our server generates a JSESSIONID (java session id) which is stored in some tables in our database and exposed to the client via a cookie. Snowplow appears to generate it’s own session_id. There does not seem to be a way to override the SnowPlow session_id with our own, so we don’t see any good options for making sure every event has our session id. Is there any way to achieve what we are trying to do?


#8

Hi Kris,

Yes sure - it’s possible to define a session context and add that to all events.

This is the direction of travel for all contextual information; indeed the existing client-side session fields in atomic.events will be ported to the client_session context eventually (the mobile trackers already use this context for sessionization).

Cheers,

Alex