Failed inserts for time Partitioned table

Abhishek_Singh · September 6, 2021, 5:52pm

Hi, I have setup a snowplow pipeline in GCP. I am able to load events table in BigQuery without any issue until I tried the below.

I tried to setup a similar pipeline where my target table is a time partitioned table. I’ve read the discussion Google Cloud Platform data pipeline optimization. I created the partitioned table based on the derived_tstamp column (with DAY granularity) manually before running the pipeline with the same schema suggested in atomic schema. But, as I am firing events from the tracker, the events are not loading to the table rather moving to the failed inserts topic in PubSub although the mutator is able to mutate the table and custom columns are getting added to the table. Could you please suggest what is the possible reason and how to resolve?
Also, my ultimate aim is to create a partitioned table with event_name as the partitioned column. Please advise how can I achieve the same.

mike · September 6, 2021, 10:24pm

Failed inserts, at least while the mutator is running are generally pretty standard but as long as they are being retried (BQ repeater) they should appear in the pipeline after the mutator has successfully created the columns.

BigQuery won’t (at the moment) let you partition by a string column. The advice here is to partition first by a timestamp and then cluster within that partition, using something like event_name.

Abhishek_Singh · September 7, 2021, 8:50am

Hi @mike,
Thanks for your response. I have been trying for few days now but the failed insert records do not seem to be loading by any chance.

I will definitely try the clustering option. Thank you.

dilyan · September 7, 2021, 10:22am

@Abhishek_Singh With regards to the failed inserts not being retried, are you trying to load real-time data (ie, load data as it is being collected) or are you trying to load a historical archive?

The mutator needs some time to make the table changes, so when the repeater sees a fail insert, it won’t retry it immediately. It waits some time (15 mins by default) before it tries to re-insert the event. This way the mutator has enough time to mutate the table. However, this waiting period is calculated as a difference between now() and the collector_tstamp of the event. So if you are trying to load historical data whose collector_tstamp is already more than 15 mins before now() then the repeater will re-try them straightaway. That leaves very little time for the mutator to do its job and the events will ultimately go to your dead-end bucket on GCS.

Abhishek_Singh · September 9, 2021, 6:29am

@dilyan, thanks for the detailed explanation on the failed inserts. This is working fine now. The problem, I believe in my case, was that I was stopping the bqloader after waiting for sometime.

Topic		Replies	Views
Bq-failed-inserts topic reason GCP pipeline	3	957	September 1, 2021
What (or who) builds the events table schema in BigQuery with the Streamloader setup? GCP pipeline	2	869	May 31, 2022
Snowplow Events from Google Bucket to BigQuery Storage targets	1	980	July 29, 2020
Reason for bq bad events topic Storage targets	6	1022	September 7, 2021
Snowplow Event Recovery on GCP GCP pipeline	4	532	September 27, 2023

Failed inserts for time Partitioned table

Related Topics