BigQuery Loader with Unstructured Events

camerondavison · January 30, 2020, 11:18pm

I am trying to understand how people use snowplow unstructured events with big query. Right now I see that it adds something like contexts_org_w3_performance_timing_1_0_0 which has the version number in the record. What happens if there is a bump in that version number to something like 1_0_1 I would expect to see contexts_org_w3_performance_timing_1_0_1 and I would no longer get any data in the contexts_org_w3_performance_timing_1_0_0 record. This means that if I build any reports with the 1_0_0 name in them, they will be broken until I can edit them to coalesce between the 1_0_0 and 1_0_1 columns. Is there any way to make the record something like 1_0 instead and by convention only add columns to the record if we increment the patch, but then create a new record column if we increment the minor version?

I am just trying to figure out how best to make this future proof as much as possible before we start depending on it too much.

PaulBoocock · January 31, 2020, 11:44am

Hi @camerondavison
The easiest way to deal with this at the moment is, as you suggest, COALESCE the two versions of the schema together.

Curently when loading a new or changed schema, it will create a full set of columns for that schema. Both sets of columns are currently required as some applications might be tracking 1_0_0 and other applications 1_0_1 for example. Your suggestion to fix these to MINOR rather than PATCH is something worth consideration but it isn’t currently how Snowplow loads into BigQuery so coalescing them as a data modelling step is the way to go.

When adding a new schema or updating one, I generally send a test event containing the new schema so that the table can be mutated to include the new schema columns then I update my data modelling jobs to take the new columns into account before I switch any tracking over to the new version, this way none of the reports will break in the transition period.

camerondavison · January 31, 2020, 2:09pm

Thanks for the quick reply. What you said makes sense but definitely means that we should not be expecting to change this very often since it will require a model update for every change

PaulBoocock · January 31, 2020, 2:28pm

It is generally a good idea to keep the number of schema changes as low as possible.

Try to spend some time designing your schemas so they are well thought out, it will be worth it in the long run. Changing schema versions causes work at both ends - from tracking changes through to data modelling changes. Changes will occur but minimising these where possible is generally the best course of action.

It is often a good idea to have an upfront design session to try and design your core schemas for both the events you would like to track across your systems and the custom schemas (i.e. the common objects in your system) you will wish to track along with these events. Getting this right early on leads to far fewer changes.

Also, I’d advise using Snowplow Mini to test your new schemas and tracking. This way you will reduce the number of new schemas (and new columns in BigQuery) that you will end up creating.

camerondavison · January 31, 2020, 2:48pm

Fair, I think there is some value too though in allowing the schema to be a little more fluid. Especially for things like adding a field with some more context. Last time I used snowplow I used it with snowflake and we sent the raw json into a column named with just the major version. This way adding new fields did not disrupt any downstream systems but they could start using it when they were ready. Is there any way to skip the shredding step and instead just put the raw json into biq query so that I can add fields without disrupting downstream processing?

I would like to avoid usage of super generic fields because of an unwillingness to have to make changes in too many places. Where I feel like we would need to upfront design the schema to be pretty generic to accommodate any future changes if it is a lot of work to make a shema change.

mike · February 3, 2020, 12:35am

There’s no way to skip this at the moment but I think the BigQuery data model will likely change to something that is more accomodating to changes in addition and revisions similar to the Snowflake model (model as a column) rather than the current addition as a column data model.

If you did want to avoid this you could short circuit the BigQuery loader by creating a consumer / Dataflow job that reads directly off the good PubSub topic - transforms it in the way you’d like - and then inserts it into BigQuery. There’s a tradeoff here as BigQuery must have the columns defined before the row is written rather than the VARIANT style model of Snowflake where you can sink arbitrary JSON. The closest you can get to this in BigQuery is to either define the columns upfront - or if you do not know the columns upfront to stringify the row and use JSON manipulation functions at query time to extract the data.

camerondavison · February 3, 2020, 3:55am

Thanks @mike . Do you have any idea on when “BigQuery data model will likely change to something that is more accomodating to changes in addition” would happen? I am just trying to get a feel for what I should be thinking about. It is a little troubling to me to think that updating my javascript tracker could update contexts_org_w3_performance_timing_1_0_0 to 1_0_1 and break any downstream systems dependent on it (given that a patch change would seem very innocuous)

Would something like https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setSchemaUpdateOptions-java.util.List- be what you are mentioning to sort of be like defining the columns upfront? I have no idea how much work it would be to implement something like that instead of the big query loader. Do you? I am not sure what the output of the good pubsub topic looks like.

Topic		Replies	Views
Example: Running Snowplow real-time pipeline on GCP with Kafka and Kubernetes Kafka real-time pipeline	6	2944	June 1, 2017
Snowplow Events from Google Bucket to BigQuery Storage targets	1	983	July 29, 2020
Merging context versions in Bigquery For data modelers & consumers	0	941	May 17, 2020
What (or who) builds the events table schema in BigQuery with the Streamloader setup? GCP pipeline	2	871	May 31, 2022
Update custom schema but mutator didn't work GCP pipeline	3	863	October 26, 2021

BigQuery Loader with Unstructured Events

Related Topics