We are using stream collector 0.17.0 with Kafka. We are observing that when there is an increase in producer latency of Kafka. The health api of the collector starts to fail. The latency of entire collector goes in minutes. Even the health check api. This makes the whole infra (all the nodes) to go into an unhealthy state and bringing the entire pipeline to a halt.
We are currently going to somewhere 2k/s req in the peak (1/3rd of the day). There are 50 partitions for the collector topic. Everything goes well when the producer latency is at 2ms levels. But everything falls apart when the latency is increased to 4-5 ms levels. I understand the latency goes 2X. But it doesn’t that high to bring everything to halt.
I have also noticed that when this unhealthy situation happens, the system.disk.read_time_pct metric goes to 30X. The system.disk.read_time_pct is increased in line with some activity like partitions reassignment. The EBS burst balance falling due to a sudden surge of incoming messages in some other topic. While all other systems (producers and consumers) stays unaffected. The snowplow pipeline comes to a halt.
What could be the reason for this. Do you think we need to make some configuration adjustment to make the above scenario work? We haven’t made any change to default configuration.
Thanks for the help