Ideal file size for enrichment


#1

Hello Snowplowers,

Snowplow settled on the 128MB LZO file size to deal with the small files problem quite early on (2013/05 to be precise: http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/).

How did you end up with the 128MB file size? More recent recommendations for Spark are file sizes anywhere between 64MB and 1GB which Snowplow fits into, especially that LZO is splittable, but I was still wondering whether the 128MB target file size is the most ideal.

HDFS stores 64MB blocks afaik so any multiple of that is ideal on HDFS but what about S3? Do you have experience with long-term S3 storage in larger, splittable files?

Gabor


#2

Hey @rgabo - lots of good questions there.

No real magic - it’s just the current default blocksize for Hadoop, dfs.blocksize, see hdfs-default.xml.

We don’t have any particular experience in this, but I suspect yes, having e.g. the enriched events stored as splittable lzo in far fewer & bigger files would be highly performant. Would love to hear what you find out if you test this!


#3

Seems like the Spark/Parquet default compression is gzip to optimize for storage of persistent data. Snappy is used for temporary data between stages in Spark by default.

The one drawback of LZO is that its licensing does not permit companies like Databricks to package it in their service so you need to install manually. Snappy does not have the issue and its very comparable.

I’ll share more experience around compression types and file sizes later on.