How to restore missing events in snowplow pipeline


#1

One of my server for IP check went down because of which all my events were going into “enrich-bad-json” kinesis stream. Now when my service is up I want to restore all those missing events in my elasticsearch.
I have tried following:
I polled my kinesis stream in which collector puts events and tried to put events from there back in the stream so that enricher can pick it but while data being processed by the enricher I am getting the following error.

Error deserializing raw event: Cannot read. Remote side has closed. Tried to read 2 bytes, but only got 0 bytes. (This is often indicative of an internal error on the server side. Please check your server logs.)"

This is the sample how my data looks when I pulled it from the kinesis stream:

\x0b\x00d\x00\x00\x00\r42.106.193.13\n\x00\xc8\x00\x00\x01f\xceuq\xa1\x0b\x00\xd2\x00\x00\x00\x05UTF-8\x0b\x00\xdc\x00\x00\x00\x12ssc-0.13.0-kinesis\x0b\x01,\x00\x00\x00\xb8Mozilla/5.0 (Linux; Android 5.0.2; Mi 4i Build/LRX22G; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/61.0.3163.98 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/176.0.0.42.87;]\x0b\x016\x00\x00\x00jhttps://www.popxo.com/trending/priyanka-wore-tiffany-jewellery-worth-95-crore-at-her-bridal-shower-768514/\x0b\x01@\x00\x00\x00\x02/i\x0b\x01J\x00\x00\x02nstm=1541062094348&e=pp&url=https%3A%2F%2Fwww.popxo.com%2Ftrending%2Fpriyanka-wore-tiffany-jewellery-worth-95-crore-at-her-bridal-shower-768514%2F&page=Priyanka%20Wore%20Jewellery%20Worth%209.5%20Crore%20At%20Her%20Bridal%20Shower%20%7C%20POPxo&refr=http%3A%2F%2Fm.facebook.com%2F&pp_mix=0&pp_max=0&pp_miy=2397&pp_may=2397&tv=js-2.8.0&tna=cf&aid=popxo-web&p=web&tz=Asia%2FKolkata&lang=en-GB&cs=UTF-8&res=360x640&cd=32&cookie=1&eid=f7161e3e-42cb-480d-bb65-8fa61e4afe0f&dtm=1541062094327&vp=360x572&ds=360x17246&vid=9&sid=a80f8ac5-787c-4000-adb3-605da38e2614&duid=5d022148-0118-46ed-b4ea-92b27943a152&fp=2481695805&uid=730786\x0f\x01^\x0b\x00\x00\x00\r\x00\x00\x00\x15Host: track.popxo.com\x00\x00\x002Accept: image/webp, image/apng, image/*, */*;q=0.8\x00\x00\x00\x1eAccept-Encoding: gzip, deflate\x00\x00\x00#Accept-Language: en-GB, en-US;q=0.8\x00\x00\x05{Cookie: __gads=ID=63640a719a2fbbdf:T=1530956020:S=ALNI_MYHYy7Jul6pjsMhgT1D5G1yBQCkNA; _privy_a=%7B%22referring_domain%22%3A%22m.facebook.com%22%2C%22referring_url%22%3A%22http%3A%2F%2Fm.facebook.com%2F%22%2C%22utm_medium%22%3A%22social%22%2C%22utm_source%22%3A%22Facebook%22%2C%22search_term%22%3Anull%2C%22initial_url%22%3A%22https%3A%2F%2Fwww.popxo.com%2F2016%2F03%2Feverything-you-need-to-know-about-having-sex-during-your-period%2F%3Frpxeng%3D1%22%2C%22sessions_count%22%3A2%2C%22pages_viewed%22%3A2%7D; _privy_D8866583716CDA595B39701E=%7B%22uuid%22%3A%224457ab75-f0de-46ac-9b88-9328ba6f45db%22%2C%22variations%22%3A%7B%7D%2C%22country_code%22%3A%22HK%22%7D; __unam=8606ef0-16474170b60-175ab92d-2; cto_lwid=4c8224e2-fabf-4b11-86d5-3d2e13cd8a2d; _ga=GA1.2.1967126997.1530956014; _gid=GA1.2.655348186.1541051268; WZRK_G=59d2a089137d46bf9b877d3b1cd7aada; _parsely_session={%22sid%22:8%2C%22surl%22:%22https://www.popxo.com/trending/priyanka-wore-tiffany-jewellery-worth-95-crore-at-her-bridal-shower-768514/%22%2C%22sref%22:%22http://m.facebook.com/%22%2C%22sts%22:1541061963927%2C%22slts%22:1541051294163}; _parsely_visitor={%22id%22:%22b84d26d5-112b-4667-bfd5-82d0cf7ad0fc%22%2C%22session_count%22:8%2C%22last_session_ts%22:1541061963927}; _fbp=fb.1.1541061913311.1941076511; WZRK_S_8R5-WK8-Z64Z=%7B%22p%22%3A1%2C%22s%22%3A1541061909%2C%22t%22%3A1541062029%7D; sp=dc5e97c4-ae3c-471e-8201-06f68fd0f2fd\x00\x00\x00sReferer: https://www.popxo.com/trending/priyanka-wore-tiffany-jewellery-worth-95-crore-at-her-bridal-shower-768514/\x00\x00\x00\xc4user-agent: Mozilla/5.0 (Linux; Android 5.0.2; Mi 4i Build/LRX22G; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/61.0.3163.98 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/176.0.0.42.87;]\x00\x00\x00%X-Requested-With: com.facebook.katana\x00\x00\x00\x1eX-Forwarded-For: 42.106.193.13\x00\x00\x00\x15X-Forwarded-Port: 443\x00\x00\x00\x18X-Forwarded-Proto: https\x00\x00\x00\x16Connection: keep-alive\x00\x00\x00\x1bTimeout-Access: <function1>\x0b\x01\x90\x00\x00\x00\x0ftrack.popxo.com\x0b\x01\x9a\x00\x00\x00$dc5e97c4-ae3c-471e-8201-06f68fd0f2fd\x0bzi\x00\x00\x00Aiglu:com.snowplowanalytics.snowplow/CollectorPayload/thrift/1-0-0\x00

I understand this is not thrift format so probably I am getting that error.
I tried converting it to thrift format and no success. Now I am trying to read unenriched thrift files which s3-store stores and will see if that brings me some success.

Could anybody suggest any other way to resolve my issue?

Thanks


#2

It sounds like you need to run a recovery job. At the moment this is quite hard (although this will be easier soon).

First off you need to identify the reason that events were going into bad rows. Once you’ve done this you’ll need to filter events (ideally have them stored on S3 rather than reading them out of a Kinesis stream) for the error message you need. Take the payload from this (which will be similar to the Thrift payload you have above).

You’ll then need to deserialize the Thrift payload, parse the contents of the Snowplow payload depending on if it is a GET or POST request, mutate the contents of the payload so that the event will then parse validation and then serialize the Snowplow payload back to Thrift. Once this recovered raw data is in S3 you can rerun the data through the enrichment process to recover it.


#3

Hi Mike,
Thanks for the reply. I did try sending the request from s3 bucket has data in thrift format. However, that also gave me the same error.
Now, I took out the ELB logs, and will try to hit the complete URI that initiates whole collectors job and will see if that returns same error or not.