Basically, we ran the EmrEtlRunner on a few day’s worth of event data on a webpage. Post-EMR job, the event data was either invalidated and filed in the ‘Bad’ S3 folder, while some events seem to have disappeared completely. There is nothing in the ‘Good’ destination. Ran the EmrEtlRunner again today on only 1 day’s worth of tracking data but got the same results.
I believe I have configured the EmrEtlRunner properly; it appears to be writing to the correct buckets (with the exception of no Good/Enriched data) and the entire cluster starts and finishes without issues. I linted the resolver file and it found no issues (was just using the template as I didn’t configure custom event schemas).
A) For Bad Enriched Data
My understanding right now is that if I’m not using any custom JSONschemas and the EmrEtlRunner is configured and running, then this data must be being captured in an invalid format at the tracking level. Is there anything obviously screwed up with the implementation of the tracker?
B) Missing Data
I can see in the cloudfront logs entries like this:
date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version 2017-12-04 21:27:11 JFK6 480 184.108.40.206 GET d1jw5wkcg8ixfp.cloudfront.net /i 200 http://website/ Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_12_6)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/62.0.3202.94%2520Safari/537.36 stm=1512422831842&e=pv&url=http%253A%252F%252Fwebsite%252F&tv=js-2.8.2&tna=cf&aid=web&p=web&tz=America%252FNew_York&lang=en-US&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=0&f_java=0&f_gears=0&f_ag=0&res=1280x800&cd=24&cookie=1&eid=f2a92856-a254-45aa-b448-1a8eee95c129&dtm=1512422831838&vp=1232x633&ds=1232x2516&vid=5&sid=8c2c7a19-5450-4e2d-9fca-73f41f24f043&duid=6c4a61c0-c09e-49bd-a74e-ad38921d7984&fp=1107931059 - Hit qH1XChWlZPNDrTQPzOD8CqB5dGgR0wfyB9xM7Xqb9ny4Dk4ybY02Lg== d1jw5wkcg8ixfp.cloudfront.net http 809 0.004 - - - Hit HTTP/1.1
Note that I have manually changed the name of the site to website in this example; the original log has a valid URL
I can see the events and query parameters in that log entry, but they don’t appear to be showing up in the Good OR Bad S3 destinations. They do however, show up in the archive. There’s lots more like it.
Would love if someone could point me in the right direction here, not sure where to start debugging. Happy to provide the config.yaml file or anything else that may be required.