Hi @Daniel_Darescu, here is my interpretation based on the documentation for
RetrySettings, and based on config you shared, and your exception.
The collector’s publisher uses a timeout of 10 seconds. Because we use 10 seconds for the initial timeout and and max timeout, it means the timeout is fixed (the RpcTimeoutMultipler is effectively meaningless).
I think the publisher tries and fails to publish the event 5 times in a row, each time allowing 10 seconds (exactly) for the RPC call. It always retries immediately (no delay) because this is a timeout, which is treated differently to an error response.
On the 6th attempt, there is
9.999950601 seconds remaining until the 60 second total timeout is reached. So the publisher makes a 6th RPC call, but this time allows
9.999950601 seconds instead of 10 seconds, which explains the exception message that you shared.
Now to the point!!
I completely agree the collector needs fixing to make these RPC timeouts configurable, and probably with a longer
maxRpcTimeout by default.
Until we have made that change, you might still be able to make your situation better by setting
backoffPolicy.totalBackoff to an even larger value, e.g.
9223372036854. I hope the publisher will then keep on retrying, until eventually the RPC call takes less than 10 seconds.
One final thought… you mentioned you have 45 collectors, but are you sure the load balancer is configured to share events equally across all collectors? Because if it re-uses tcp connections to the same collector, then that might be causing the unlucky publisher to exceed PubSub limits.