Hi @sdbeuf, apologies for the long radio silence on this.
System latency is a measure of how long an element takes to be processed from source to sink. That is, it is a sum of time spent in the source (the
good Pub/Sub topic) plus time spent inside the BQ Loader.
We therefore have two possible avenues of investigation:
- the data may be spending too long stuck in the topic instead of being consumed
- the data may be taking too long inside the Dataflow pipeline.
To further debug the first option, there is a Stackdriver metric called
pubsub.googleapis.com/subscription/oldest_unacked_message_age, which shows the age (in seconds) of the oldest unacknowledged message in a subscription. Are you able to check the values of that metric for the time intervals when the spikes occur? If there is a correlation between the two, the likely reason for the build up in latency is that the data is not being consumed.
On the second option, inside the Dataflow job there is a step where events are checked against their schema and that requires a call to an Iglu server. It might be the case that some events trigger more calls (perhaps they have multiple contexts?). Could you check a chart of network IO, perhaps via the
loadbalancing.googleapis.com/https/* Stackdriver metrics, to see if any extra activity is happening that coincides with the latency build up?