Ideal file size for enrichment

rgabo · May 27, 2016, 7:42am

Hello Snowplowers,

Snowplow settled on the 128MB LZO file size to deal with the small files problem quite early on (2013/05 to be precise: http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/).

How did you end up with the 128MB file size? More recent recommendations for Spark are file sizes anywhere between 64MB and 1GB which Snowplow fits into, especially that LZO is splittable, but I was still wondering whether the 128MB target file size is the most ideal.

HDFS stores 64MB blocks afaik so any multiple of that is ideal on HDFS but what about S3? Do you have experience with long-term S3 storage in larger, splittable files?

Gabor

alex · June 4, 2016, 12:15am

Hey @rgabo - lots of good questions there.

No real magic - it’s just the current default blocksize for Hadoop, dfs.blocksize, see hdfs-default.xml.

We don’t have any particular experience in this, but I suspect yes, having e.g. the enriched events stored as splittable lzo in far fewer & bigger files would be highly performant. Would love to hear what you find out if you test this!

rgabo · June 8, 2016, 7:23am

Seems like the Spark/Parquet default compression is gzip to optimize for storage of persistent data. Snappy is used for temporary data between stages in Spark by default.

The one drawback of LZO is that its licensing does not permit companies like Databricks to package it in their service so you need to install manually. Snappy does not have the issue and its very comparable.

I’ll share more experience around compression types and file sizes later on.

Topic		Replies	Views
Expected Snowplow performance Enrichment	2	1044	June 7, 2018
Cloud Storage Loader Output Scheme GCP pipeline	1	1276	July 30, 2021
Minimal Enrich Setup? Enrichment	4	2724	June 29, 2017
Is it possible to have snowplow events datamart in S3 datalake instead of Redshift/Snowflake/Databriks Enrichment	0	235	January 4, 2024
Snowplow and the Apache Iceberg Ecosystem Storage targets	4	942	April 24, 2023

Ideal file size for enrichment

Related Topics