We have been using Snowplow for a while now and have just started adding some more events that have been increasing our volume significantly.
One thing we have noticed is that the shred step is taking longer and longer, when we expected the enrich step to be the time consuming one.
Our cluster config is:
Master: 1 x m1.medium
Core: 3 x i2.2xlarge
Task: 40 x m3.2xlarge
The enrich step took 1h59m and the shred step took 6h48m, is there something about our config that is causing this? or is it normal?
The files they are processing are from the stream collector, approx 40mb compressed per batch, this was around 24 hours or so worth of data as we had an issue on the previous run.
Not sure if I need to change the cluster config to optimize speed or something.