About badrows pipeline choices


I have a question about Snowplow badrows. I want to build a badrows pipeline to monitor the real-time Snowplow pipeline status. Currently, I am following this blog (Storing Snowplow bad row events in BigQuery | by Jonathan Merlevede | datamindedbe | Medium) and using the Google Cloud Functions to trigger the pubsub_to_bigquery function once there are some data going into bad, enriched-bad, bq-failed-inserts, or bq-bad-rows topic. But I find it is difficult to define the schema of badrows table in BigQuery, since the schema of badrows data from PubSub sometimes have some conflict with the one in Iglu repo.

I also figured out another solution to use Snowplow Google Cloud Storage Loader(GitHub - snowplow-incubator/snowplow-google-cloud-storage-loader: Dataflow job to dump the content coming from a PubSub subscription into Cloud storage). But it is not real time, and there also exist some schema conflict problems, not quite sure the reason.

What do you think is a better way, or is there any other solution? I’d appreciate it if someone could give me some advice.

Hi @phxtorise! It is certainly a good viable option to use the Snowplow Google Cloud Storage Loader and load the bad rows to gcs. From there, you can use BigQuery to build a table on top of the gcs directory.

We have instructions on our docs website for how to build the BigQuery table. The instructions tell you how to avoid the schema conflict problems you mentioned, by explicitly defining the table structure instead of using the auto-detect feature.