GCP realtime pipeline - growing pubsub queue

Hi

I have a pipeline in production (on GCP) which on I completely provision using Terraform. It has a VM group for the collector and a VM group to facilitate the beam enrich and the BigQuery Mutator & Loader.

So I have monitoring on my pubsub subscriptions and every now and then I get an error where collector.good the queue is building up. It always sorts itself out but in my opinion it’s not normal that I occasionally have message stuck in the subscription for 5-6 minutes.

I did see that the wall time of streaming inserts is long.

Do you have an idea what might be the issue here?

Thanks in advance
Sam

Hi @sdbeuf, are you using a single worker or is it an auto-scaling Dataflow job? One very common cause for buildup of latency in the Pubsub topic is that the job is not consuming from it while it’s resizing (either up or down). Once the resize is complete, the job catches up.

If you go to the ‘JOB METRICS’ tab in the Dataflow job UI, you can compare the latency and auto-scaling charts to see if any auto-scaling action aligns with latency buildup.

Hi

The autoscaling is on. Here you can find the graphs


Looking to the CPU graph it’s logical that it didn’t scale, so the cause is somewhere else.

What I changed recently is the usage of iglu-resolver-1.0.4 instead of 1.0.1.

Hi @sdbeuf,

A few questions to hopefully help diagnose this.

  • Which version of BigQuery Loader are you using?
  • Have you noticed a correlation between these latency peaks and spikes in incoming traffic?
  • Can you please clarify: which component do you mean by iglu-resolver-1.0.4?

Hi @dilyan

0.4.0

No not correlation, as you also can see that the CPU of the dataflow doesn’t peak at all.

It’s the schema in the beam enrich pipeline. In my iglu_resolver.json, the schema I use is 1.0.3 (sorry not 1.0.4).

  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3",

Thanks in advance!

Hi @sdbeuf, apologies for the long radio silence on this.

System latency is a measure of how long an element takes to be processed from source to sink. That is, it is a sum of time spent in the source (the good Pub/Sub topic) plus time spent inside the BQ Loader.

We therefore have two possible avenues of investigation:

  • the data may be spending too long stuck in the topic instead of being consumed
  • the data may be taking too long inside the Dataflow pipeline.

To further debug the first option, there is a Stackdriver metric called pubsub.googleapis.com/subscription/oldest_unacked_message_age, which shows the age (in seconds) of the oldest unacknowledged message in a subscription. Are you able to check the values of that metric for the time intervals when the spikes occur? If there is a correlation between the two, the likely reason for the build up in latency is that the data is not being consumed.

On the second option, inside the Dataflow job there is a step where events are checked against their schema and that requires a call to an Iglu server. It might be the case that some events trigger more calls (perhaps they have multiple contexts?). Could you check a chart of network IO, perhaps via the loadbalancing.googleapis.com/https/* Stackdriver metrics, to see if any extra activity is happening that coincides with the latency build up?

Hi @dilyan

  • With purple: collector.good
  • With blue: enriched.good

The highest peaks are around 2.5 minutes.

Backend request count:


request bytes:

I could continue giving you all the loadbalancing.googleapis.com/https/* monitoring views but these follow all the same pattern.

At the moment I cannot find a correlation between the https requests and the pubsub queues.

Thanks in advance,
Sam

Thanks for sending these @sdbeuf. Unfortunately, they don’t help to establish a firm culprit.

Digging through the Dataflow metrics, it looks like there is a metric called dataflow.googleapis.com/job/per_stage_system_lag. I wonder if that could help pinpoint which step in the Dataflow job seems to contribute most to overall system lag. If you filter by job ID and group by stage, we might be able to see if one of the steps seems to rise above the others.




The red and green line (stage f15 & f17 seem to have less spikes than the others.

Thanks in advance,
Sam

@dilyan The thread turned a bit dead, did you manage to find a potential issue?

Hi @sdbeuf, no, we haven’t nailed a culprit in this case.

We are now on version 0.5.1 of the BigQuery Loader apps. Would it be possible to upgrade? At the very least it would mean you’d be using the latest versions of Scio / Beam. 0.4.0 is using a Beam SDK whose version is now 4 releases behind the latest one.

Incidentally, the next BQ Loader release 0.6.0 is slated to add a new (initially experimental) loader that does not rely on Beam / DataFlow, in case that is of interest.

Hello @dilyan
When is the release of BQ Loader 0.6.0 planned? As this is a costly DataFlow job might be a considerable cost reduction for us. If this is not too far I would postpone the upgrade to the 0.6.0 version.

No firm date but we are in the final pre-release stages, so hopefully soon.

Ok, super thanks for the information.

Hi @sdbeuf, this has now been released:

1 Like