Rookie mistake? Great raw logs, weird characters in Kinesis stream


#1

Hi everyone,

We have captured the first couple of million pv events and our ELB logs look great. Example:

2017-08-03T04:01:02.472157Z snowplow x.x.x.x:50310 x.x.x.x:8080 0.000119 0.002657 0.000058 200 200 0 43 "GET https://sp.domain.tld:443/i?stm=1501732883853&e=pv&url=https%3A%2F%2Fwww.domain.com%2Fsome%2Fpath%2F&page=some%3Apage&refr=https%3A%2F%2Fwww.referrer.com%2F&tv=js-2.7.2&tna=sp_trackername&aid=someCMS&p=web&tz=America%2FDenver&lang=en-CA&cs=utf-8&f_pdf=0&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=1&f_java=1&f_gears=0&f_ag=1&res=1920x1080&cd=24&cookie=1&eid=a589eb3d-3c92-4ecf-9d7a-4055642724e2&dtm=1501732883850&vp=1920x934&ds=1903x3120&vid=1&sid=f338b530-dd5b-4d21-8600-2a4379953bab&duid=3d5b94a2-033e-4f5f-9b08-b7132a7836c7&fp=4064590807&co=%7B%22schema%22%3A%22iglu%3Acom.snowplowanalytics.snowplow%2Fcontexts%2Fjsonschema%2F1-0-0%22%2C%22data%22%3A%5B%7B%22schema%22%3A%22schema%22%2C%22data%22%3A%7B%22someData%22%3A%22someValue%22%7D%7D%5D%7D HTTP/1.1" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko" ECDHE-RSA-AES128-SHA256 TLSv1.2

However, when we preview the data in the Kinesis stream it looks like this:

This looks like an encoding error. Our raw data should be UTF-8, that’s why I am surprised to see issues that look encoding-related.

So far we have only set up JavaScript trackers and scala-stream-collector and want to add stream-enrich (via Kinesis).

Any help greatly appreciated,
Ian


#2

Hi Ian,

What you’re seeing there is the raw bytes associated with the Thrift serialized collector payload. The Scala stream collector when sinking to Kinesis will sink in this Thrift format. The stream enrich process will read this Thrift serialized format out of this collector Kinesis stream and then if configured will sink out to an additional new Kinesis enriched stream in a Base64 encoded TSV format. If you read off this enriched stream you’ll have something that’s more human readable.

Here’s my poorly drawn ASCII diagram to explain


   +--------------------+          +-------------------------+         +---------------+     +-------------------------+
   |                    |          |                         |         |               |     |                         |
   |                    |  Thrift  |                         |  Thrift |               | TSV |                         |
   | collector          +----------+  Kinesis payload stream +---------+ Stream enrich +-----+ Kinesis enriched stream |
   |                    |          |                         |         |               |     |                         |
   |                    |          |                         |         |               |     |                         |
   +--------------------+          +-------------------------+         +---------------+     +-------------------------+

You can of course read the TSV directly off the enriched stream or also use one of the Analytics SDKs in Scala or Python to help.


#3

Hi Mike,

Thank you so much for your quick reply! I had read about Thrift but didn’t remember it when I saw these encoding-like errors. Your diagram should definitely be added to the official documentation! :wink:

Best regards,
Ian