Archive/raw file format and encoding details


We need to filter out certain records for all data files in archive/raw, then write the resulting data back to s3. We are using pyspark, but we are not able to read the LZO file properly and getting ��X�L)��W�!q���e instead of actual value. The command we used to read are

files = sc.newAPIHadoopFile(“s3://archive/raw/2018-07-01-01-00/”, “com.hadoop.mapreduce.LzoTextInputFormat”,“”,“”)
lzoRDD = x: x[1])
– apply filter function on lzoRDD, then write it back to s3:
lzoRDD.saveAsTextFile(path=sys.argv[2], compressionCodecClass=“com.hadoop.compression.lzo.LzopCodec”)

The resulting data set can no longer be correctly parsed by snowplow-emr-etl-runner. What are the problems in the input variables to sc.newAPIHadoopFile ()?

We use “collectors: format: thrift”. What is the file format and encoding details of archive/raw files?
Thanks for help,


This isn’t a particularly easy thing to do @RichardJ, but this file should put you on the right course:




Thank you Alex. I know that’s not easy especially for us new to the field; I haven’t gotten much clue even after reading your code. But I have two specific questions about the log files in raw/ folder:

  1. what encoding do they use? We see some un-recognized special characters (they are killer!) when using spark open those raw logs. Spark cluster defaults to UTF-8; un-recognized special characters mean they are not encoded in UTF-8?

  2. Our log files are in thrift format. Each record is a [key, value] pair or a sequence of tab separated strings?

Many thanks,