Stream collector health check fails with slight increase in producer latency of Kafka

Hey there,

We are using stream collector 0.17.0 with Kafka. We are observing that when there is an increase in producer latency of Kafka. The health api of the collector starts to fail. The latency of entire collector goes in minutes. Even the health check api. This makes the whole infra (all the nodes) to go into an unhealthy state and bringing the entire pipeline to a halt.

We are currently going to somewhere 2k/s req in the peak (1/3rd of the day). There are 50 partitions for the collector topic. Everything goes well when the producer latency is at 2ms levels. But everything falls apart when the latency is increased to 4-5 ms levels. I understand the latency goes 2X. But it doesn’t that high to bring everything to halt.

I have also noticed that when this unhealthy situation happens, the system.disk.read_time_pct metric goes to 30X. The system.disk.read_time_pct is increased in line with some activity like partitions reassignment. The EBS burst balance falling due to a sudden surge of incoming messages in some other topic. While all other systems (producers and consumers) stays unaffected. The snowplow pipeline comes to a halt.

What could be the reason for this. Do you think we need to make some configuration adjustment to make the above scenario work? We haven’t made any change to default configuration.

Thanks for the help

Hi @jimy2004king,

My understanding is that:

  1. There is some Kafka activity like partitions reassignment, as indicated by the system.disk.read_time_pct metric that goes to 30X on the brokers
  2. As a result producer latency increases a bit on the collector
  3. Collector starts to misbehaves

1 and 2 are normal behavior but 3 is not.

To better understand what’s happening I have a few questions:

a. Where/how do you measure this latency?
b. Does this happen only when there is a burst or it can happen randomly?
c. How many instances of the collector do you have?
d. How does CPU and memory usage of the collector behave when this happens?

e. So when an event is sent to the collector, this event makes it to Kafka a few minutes later?

f. You mean that the collector can’t receive any event any more?

We have datadog agent running on the Kafka host. And it keeps reporting the producer latency.

This only happens when system.disk.read_time_pct is increased either of all volumes of Kafka or any one volume of Kafka on which snowplow topics resides.

It’s on autoscaling with container of 200 CPU and 300 MEM (in regards to AWS ECS standards). The number of containers go from 2 to 12 with an avg of 6.

The CPU and MEM utilization shows 3X drops for that period. And due to drop in CPU scaling in kicks in.

Actually, no events comes to the collector. As I told you the health check starts failing because of high latency. So all of a sudden all the targets in TG are marked unhealthy and all requests starts failing with 503.

Yes. Due to above stated reason.

@BenB Any update here ?

Hi @jimy2004king, sorry we were not able to investigate the issue yet, we plan to do it in the beginning of next week.