Expected Snowplow performance


#1

We just did a bit of experiments with processing some small data volumes with the Snowplow batch process and found that to process 500MB of Cloudfront logs it took the

  • enrich step 37m; and
  • shred step 1h23m

using an m4.4xlarge node (16CPU and 64GB RAM). The shred step has global event deduplication enabled and the DDB requests were throttled slightly.

Our Spark configuration is:

Classification Property Value
spark maximizeResourceAllocation false
spark-defaults spark.yarn.driver.memoryOverhead 1440m
spark-defaults spark.executor.cores 4
spark-defaults spark.yarn.executor.memoryOverhead 1440m
spark-defaults spark.executor.instances 3
spark-defaults spark.default.parallelism 24
spark-defaults spark.driver.cores 4
spark-defaults spark.driver.memory 12896m
spark-defaults spark.executor.memory 12896m

Does anyone have an idea if this is a ‘normal’ amount of time for Snowplow to process this much data. We generally found the shred step to take a longer than the enrich so optimistically it could have been 40mins quicker. Really we’re interested in ball park figures.

Our configuration comes from the spreadsheet here.

Thanks
Gareth


#2

That shred step seems unusually slow. If you turn off event deduplication and run with the same dataset how long does the shredding take?


#3

@mike it takes 15mins to run the shredding on the 500MB data set with the above cluster and config but no global event deduplication.