Snowplow settled on the 128MB LZO file size to deal with the small files problem quite early on (2013/05 to be precise: http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/).
How did you end up with the 128MB file size? More recent recommendations for Spark are file sizes anywhere between 64MB and 1GB which Snowplow fits into, especially that LZO is splittable, but I was still wondering whether the 128MB target file size is the most ideal.
HDFS stores 64MB blocks afaik so any multiple of that is ideal on HDFS but what about S3? Do you have experience with long-term S3 storage in larger, splittable files?