Streaming bad events are not queryable


#1

I have set up a real-time scala stream collector > streaming enrich > s3loader pipeline. I have moved from a batch pipeline setup which used Elastic beanstalk Clojure collector.

I use AWS Athena to query bad events. I used to be able to query the batch pipline bad events, by using presto functions : from_utf8(from_base64(line))
this gave me a TSV line of raw events and was subsequently easy to deconstruct and parse in SQL queries.

But since I moved to the scala stream collector, the data in line of bad events has become undecipherable. Possibly because the line is thrift encoded. Is there a way to make it more human readable and more importantly queryable?


#2

Hi @arihantsurana,

I’m not sure this is the reason for your troubles, but the key difference in bad rows for the stream pipeline is that the partition structure changes. The batch pipeline outputs a structure partitioned by run, but since the stream pipeline doesn’t run on a schedule in the same way, that’s not possible.

There’s a recent tutorial here on querying bad rows specifically for the real-time format. Note that if you have a lot of data in bad rows, it’s worth copying a sample to another bucket and querying that at first, to avoid running up charges.

Is it possible that your queries produce an undecipherable output because of this change in format?

Best,


#3

line is Thrift encoded which makes it a bit of a pain to turn into a human readable format. It is possible to decode the Thrift record (for example our Snowplow Chrome Inspector will do it for you) but I don’t think you’ll be able to do this in Athena.

Currently Athena doesn’t support UDFs (though BigQuery does) but once this becomes an option it will be possible to have a UDF capable of deserialising the Thrift record.


#4

Hey @Colm, I solved for this a few weeks ago by adding date partitioned writing in S3Loader. https://github.com/snowplow/snowplow-s3-loader/pull/135
so I am able to query the partitioned bad rows, and I can see the error json and related details, trouble is with the line field specifically.


#5

Thanks @mike for the insight.
I think there is support for writing simple lambda functions in presto, I might attempt writing one for the thrift decoding. Alternatively I think we can load Python UDFs into redshift spectrum.

Can you point me to a bigquery version of the udf I can use as a starting point?


#6

Ah I hadn’t seen that - thanks for the PR @arihantsurana!


#7

If you’re running straight Presto you can certainly use UDFs otherwise Python UDFs within Redshift are one option (although they run quite slowly).

I don’t have the BigQuery version handy - but if you’re writing it in Python I wrote a blog post about this topic.