Enriched data post-EmrEtlRunner is Bad Or Missing


#1

Hey,

First time user, just setting up a proof-of-concept of the basic setup using the cloudfront collector, javascript tracker, and an AWS stack. Basically following this setup guide in github to a tee.

Basically, we ran the EmrEtlRunner on a few day’s worth of event data on a webpage. Post-EMR job, the event data was either invalidated and filed in the ‘Bad’ S3 folder, while some events seem to have disappeared completely. There is nothing in the ‘Good’ destination. Ran the EmrEtlRunner again today on only 1 day’s worth of tracking data but got the same results.

Here’s the javascript tracker snipppets on the page:

<script type="text/javascript">
;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
};p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.8.2/sp.js","snowplow"));

window.snowplow('newTracker', 'cf', 'd1jw5wkcg8ixfp.cloudfront.net', { // Initialise a tracker - point to cloudfront that serves S3 bucket w/ pixel 
  appId: 'web',
  cookieDomain: null,
  gaCookies: true
});
window.snowplow('enableActivityTracking', 30, 10);
window.snowplow('trackPageView');
window.snowplow('enableLinkClickTracking', null, true, true);

</script>

I believe I have configured the EmrEtlRunner properly; it appears to be writing to the correct buckets (with the exception of no Good/Enriched data) and the entire cluster starts and finishes without issues. I linted the resolver file and it found no issues (was just using the template as I didn’t configure custom event schemas).

A) For Bad Enriched Data
My understanding right now is that if I’m not using any custom JSONschemas and the EmrEtlRunner is configured and running, then this data must be being captured in an invalid format at the tracking level. Is there anything obviously screwed up with the implementation of the tracker?

B) Missing Data
I can see in the cloudfront logs entries like this:

date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version 

2017-12-04	21:27:11	JFK6	480	24.212.244.202	GET	d1jw5wkcg8ixfp.cloudfront.net	/i	200	http://website/	Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_12_6)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/62.0.3202.94%2520Safari/537.36	stm=1512422831842&e=pv&url=http%253A%252F%252Fwebsite%252F&tv=js-2.8.2&tna=cf&aid=web&p=web&tz=America%252FNew_York&lang=en-US&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=0&f_java=0&f_gears=0&f_ag=0&res=1280x800&cd=24&cookie=1&eid=f2a92856-a254-45aa-b448-1a8eee95c129&dtm=1512422831838&vp=1232x633&ds=1232x2516&vid=5&sid=8c2c7a19-5450-4e2d-9fca-73f41f24f043&duid=6c4a61c0-c09e-49bd-a74e-ad38921d7984&fp=1107931059	-	Hit	qH1XChWlZPNDrTQPzOD8CqB5dGgR0wfyB9xM7Xqb9ny4Dk4ybY02Lg==	d1jw5wkcg8ixfp.cloudfront.net	http	809	0.004	-	-	-	Hit	HTTP/1.1

Note that I have manually changed the name of the site to website in this example; the original log has a valid URL

I can see the events and query parameters in that log entry, but they don’t appear to be showing up in the Good OR Bad S3 destinations. They do however, show up in the archive. There’s lots more like it.

Would love if someone could point me in the right direction here, not sure where to start debugging. Happy to provide the config.yaml file or anything else that may be required.

Thanks!


#2

Hi @sharden,

I’m not sure if this will cause the issue you’re having, but it might be that the 'trackPageView' method should be called after 'enableLinkClickTracking' with the Javascript tracker.

<script type="text/javascript">
;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
};p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.8.2/sp.js","snowplow"));

window.snowplow('newTracker', 'cf', 'd1jw5wkcg8ixfp.cloudfront.net', { // Initialise a tracker - point to cloudfront that serves S3 bucket w/ pixel 
  appId: 'web',
  cookieDomain: null,
  gaCookies: true
});
window.snowplow('enableActivityTracking', 30, 10);
window.snowplow('enableLinkClickTracking', null, true, true);
window.snowplow('trackPageView');

</script>

As regards the process of debugging- you can always set up a Snowplow Mini instance - this is a small-scale, real-time implementation of Snowplow, which allows you to test your tracking in real-time. Here’s the quick start guide, I wouldn’t imagine it would take you long to set up considering how far you’ve gone with the full pipeline.

I hope this is helpful!

Colm


#3

Hey @Colm,

This is helpful, thank you. I’ll move ‘trackPageView’ down below ‘enableLinkClickTracking’ and let you know if that helps.

I had setup Snowplow Mini to poke around with a bit but was hoping to set up this full pipeline up as a proof-of-concept to gather some actual analytics data on a live page.

I guess the one thing about this situation that confuses me the most is that there are some events that don’t seem to be ending up in either s3:/ /enriched/good or s3://enriched/bad. My understanding was that the EmrEtlRunner must output an enriched event to one of those 2 buckets.

If I look at the archive I can see the events recorded as page view, page ping, link click, etc. Yet the ‘Bad’ enriched events contain mostly what appears to be bot traffic (error message is Request path … does not match (/)vendor/version(/) pattern nor is a legacy /i(ce.png) request"). At least, when googling that error message in this forum it appears to be largely written off as bot traffic.

I fully acknowledge there may be some major gaps in my understanding of the Snowplow stack here, and if that’s the case don’t hesitate to point them out.


#4

I meant that setting up Mini and sending events to it allows you to see if the issue is with the tracker - which gives you a hint as to where to look.

You are correct in that events should land in the Bad bucket if they fail validation - and this would be rare for standard events. And you are also correct that that error message means it’s probably bot traffic.

However, I’m not an engineer so I’m not sure what else might be going wrong.

Do report back on how it goes.

Best,
Colm


#5

Hi @sharden

I think what you are describing is normal and if I understand correctly you haven’t lost any data it’s now in the archive/enriched (and presumably archive/shredded) destinations. This is the correct behaviour for Snowplow and is part of the way they ensure events are not accidentally processed multiple times in the event of an error. The flow is something like:

  1. Cloudfront logs recorded in the in bucket
  2. EMR step to move the events to the processing folder
  3. EMR enriched events step which outputs to cluster local HDFS
  4. Move the enriched event to S3 enriched/events folder
  5. Run shredding step on HDFS enriched events which write to cluster local HDFS
  6. Moves this data to the enriched/shredded folder
  7. I believe at this point all the data in enriched/shredded is loaded into any storage targets, we’re not doing this so I can’t look up the exact point it happens
  8. Move the raw Cloudfront logs in processing to the archive/raw
  9. Move the enriched/events to archive/enriched
  10. Move the enriched/shredded to archive/shredded

None of the moves will proceed if there is still data in the processing, enriched/events or enriched/shredded as this could result in either the EMR step or the storage loader steps processing events more than once. All processed data should be in the archive with the appropriate event type prefix (raw, enriched, the long shredded/vendor=... prefix) and also the Snowplow run ID.

The above numbered steps I have just copied by looking at the steps in my EMR cluster. That’s the best place to see what Snowplow is doing. With the caveat I haven’t run a modern Snowplow storage loader step so I inferring where it would be based on my recollection of the documentation.

I hope this helps, I’m not quite sure I’ve correctly interpreted what you’ve said.

Gareth


#6

Hi @gareth,

This was very helpful. In the end I think the fundamental misunderstanding on my part was simply on where these enriched events live once the entire enrichment process is complete.

Here is a snapshot of the ‘buckets’ section of my config file:

buckets:
  assets: s3://snowplow-hosted-assets 
  jsonpath_assets: 
  log: s3://poc-snwplw-etl/logs
  raw:
    in:                  
      - s3://poc-snwplw-logs         
      #- ADD HERE         
    processing: s3://poc-snwplw-etl/processing
    archive: s3://poc-snwplw-etl/archive    
  enriched:
    good: s3://poc-snwplw-data/enriched/good     
    bad: s3://poc-snwplw-data/enriched/bad        
    errors:     
    archive: s3://poc-snwplw-data/enriched/archive   
  shredded:
    good: s3://poc-snwplw-data/shredded/good       
    bad: s3://poc-snwplw-data/shredded/bad        
    #errors: ADD HERE     # Leave blank unless :continue_on_unexpected_error: set to true below
    archive: s3://poc-snwplw-data/shredded/archive  

I was expecting to find the enriched events in s3://poc-snwplw-data/enriched/good. That folder was empty, while s3://poc-snwplw-data/enriched/bad contained some events that did look bad.

However, following your advice I took a closer look at the ‘archived/enriched’ folder (s3://poc-snwplw-data/enriched/archive in my setup) and there seem to be processed events here.

Basically I think I was just thrown off by the ‘archive’ terminology. It seems like all the enriched events I want to analyze are there after all.

I’ll try pushing these archived/shredded events into Redshift and see if I run into any issues. Thanks again for your insight!


#7

Excellent glad it helped. I have the same naming problem as we want to run our own EMR jobs on the ‘archived’ data. I thought about renaming the folder at least but haven’t been bold enough.