We just did a bit of experiments with processing some small data volumes with the Snowplow batch process and found that to process 500MB of Cloudfront logs it took the
- enrich step 37m; and
- shred step 1h23m
m4.4xlarge node (16CPU and 64GB RAM). The shred step has global event deduplication enabled and the DDB requests were throttled slightly.
Our Spark configuration is:
Does anyone have an idea if this is a ‘normal’ amount of time for Snowplow to process this much data. We generally found the shred step to take a longer than the enrich so optimistically it could have been 40mins quicker. Really we’re interested in ball park figures.
Our configuration comes from the spreadsheet here.