So yesterday we did some stress testing with our Snowplow pipeline, every stage went fine until we reached the Snowflake Loader stage where it would spring up a cluster to process last hour’s events. For one hour of events during our stress test, there are 1121 enriched and gzipped files in the S3 bucket for the good stream from a S3 Loader. Each file averages around 30KB in size as seen below:
We got an EMR cluster with 1 m4.large master and 1 m4.large core to process these events. However, it has been running for 16 hours now and still not finishing it, looking at the task queue, it will take a further day or two to process it.
But the CPU utilisation, memory utilisation, as well as disk space and disk queue length are well below capacity. So it is hard for me to believe the slowness is due to us not having enough instances or powerful machines.
Any suggestions as to why this is happening, and how I can optimise this in our pipeline?