we are saving our Snowplow Events (atomic + custom contexts) both to BigQuery and to a Storage Bucket. Because of a bug in our pipeline, we now need to copy the events from the Buckets to BigQuery. Is there any documentation or ideas how to associate the raw events in the buckets with the correct event schema?
I would really appreciate your help
I think that when you say ‘raw events’, you’re referring to the events that come out of the enrich job - if that’s not correct please let me know.
If that’s the case, I would recommend writing a dataflow job to insert the data back into the ‘good’ PubSub topic, for the loader to re-load them to BigQuery. The BQ loader will consume them and will handle all the schemas and mutation required to re-load all the data for you. You could try to do it manually but you’d basically just be trying to re-write the job that the loader already does - which sounds to me like an awful lot of work.
To give you an example, this project reads bad row data (which come from failures at loader) and inserts them from GCS to Pub/Sub. This is basically what you need to do except the rows are already good, so you just need to insert them as-is.
Hope that helps!