Here we are trying to estimate the resources would be required to handle the load we plan to put in next 3 months.
Before jumping to questions, let me put down some context.
We did a load testing of 20M events within the period of 2 hours. The load test was done with the help of Avalanche tool.
We also processed the events through EMR. EMR took 1 day and 9 hours to complete processing of 20M events. The instance type used was m1.medium. Below was the average CPU utilization of all machines (1 master and 2 core).
Master - between 15 to 20% all the period.
Core I - more than 95% all the period.
Core II - more than 95% for first 3h, below 5% for next 6h, more than 90% for rest of period.
As learned from the above insights, it is cleared that CPU is the bottleneck for core machines. So the questions are ::
- Why was the Core II CPU was idle for 6h ?
- What should be the approach to reduce the time take by EMR job ? Increasing number of instances or increasing the compute capacities of the machines ?
- Above 20M events were in 2 to 3 log files, but in real scenario there will be 24 log files containing this much events. Will that drastically change the CPU Utilization pattern from above ?
We would like EMR to be completed in max 4 hours. So what should be the configuration based on some some real life experience if you have ?