I wanted to use Spark to read and parse the bad events that were compressed with lzo. However, looks like they were also encoded with Thrift at some point.
What I have is something like this:
val input = sc.textFile(“s3://snowplow-raw-events-bucket/2020-10-07-00/2020-10-07-00/raw*.lzo”)
val df = input.toDF()
However it’s showing a lot of unrecognizable dark question marks instead of showing a nice Json. How do get around this ?