Issues on GCP streamloader

popi · March 18, 2022, 2:08pm

Hello,
I am running Snowplow on GCP. Since 2022-03-08 I have noticed that less and less events have been delivered to BigQuery table. I checked the amount of undelivered messages on PubSub topic subscription and it was increased since that date. What I checked so far:

I have dedicated VM for running snowplow bigquery streamloader (v 1.0.2) processes with autoscaling rules. The process was/is running, just doing almost nothing. In best case inserting 100k events per day.
I have checked quota and api rates - nothing is even close to limits.
On APIs & Services Cloud Pub/Sub API I have noticed that google.pubsub.v1.Subscriber.StreamingPull latency is exactly increased since the date when data started to pileup.
I also ran snowplow bigquery streamloader in DEBUG mode, but could not see any errors.

Does anyone else have similar issue on GCP? Any ideas on where to check the root cause and stop loosing the data because of retention? Thanks a lot.

dilyan · March 21, 2022, 11:43am

Hi @popi, the first place to check will be the failed inserts. The best way to do it in 1.0.2 is to check the logs of your repeater application. It should contain records for how long the repeater has been running, and how many events it processed. If the events could ultimately not be inserted, for whatever reason, they will have been written to the GCS bucket you’ve specified under repeater.output.deadLetters in your repeater config.hocon file. There, you should find the failed events (bad rows) with an error message explaining why they couldn’t be inserted.

I would also recommend you upgrade to 1.2.0, which has much improved logs.

popi · March 22, 2022, 9:32am

Thanks. That worked for me. I checked deadLetters and adjusted the schema of BQ and then data start flowing into BQ.

EddieM · March 22, 2022, 9:46am

Thanks for the update - really useful to other people.
Cheers,
Eddie

Topic		Replies	Views
BigQuery Stream Loader becomes very non-performant after processing large numbers of events GCP pipeline	4	1158	January 27, 2023
Full trace of an event from tracker to bigquery GCP pipeline	3	829	January 20, 2022
Enrich pubsub and bq streamloader down often GCP pipeline	3	2017	November 7, 2021
All events do not arrive to Snowplow pipeline after being sent in bulk using AsyncEmitter in Python Tracking SDKs	11	922	March 26, 2022
Why is Bigquery Stream Loader slow? GCP pipeline	5	364	December 1, 2023

Issues on GCP streamloader

Related Topics