Modelling Snowplow data in Google BigQuery


#1

In this post I’ll outline three possible models of modelling Snowplow data in BigQuery - along with the pros and cons associated with each approach. This guide is by no means comprehensive so feel free to jump in with corrections, recommendations or queries.

Considerations

  1. Ease of use
  2. Performance
  3. Pricing
  4. Features

Ease of use

The data that is inserted into BigQuery should be easy to model and reason about. This means avoiding having to write complex syntax or user defined functions to perform common tasks in BigQuery.

Performance

Data should be modelled in a way that takes advantage of the way that BigQuery (and the underlying filesystem Colossus) distributes and partitions data as well as the query engine (Dremel).

Pricing

BigQuery offers a fundamentally different pricing model which is a summation of total size of uncompressed data in BigQuery and bytes processed (uncompressed) in queries per month. Additional costs may be incurred for streaming workloads. Although a fixed pricing model exists for BigQuery this is unlikely to be suitable for a large number of Snowplow users (fixed pricing starts from $40k/month).

Features

While BigQuery offers some familiar database concepts such as partitioning tables and decoupled compute/storage requirements other features such as nested/repeated structures are less common. Where possible Snowplow data should exploit the features unique to BigQuery to improve flexibility, performance and cost.

Approaches

1. “Naive” approach

2. Shredded table approach

3. Shredded column approach

1. “Naive” approach

This approach is the least effort to implement from a technical cost in terms of sinking data into Redshift, however it moves that cost downstream to analysis time.

The idea of this approach is to convert the TSV+JSON mixed format into a JSON document which is inserted directly into BigQuery. This does not entail any shredding so the contexts, derived_contexts and unstruct_event fields will contain nested data.

How are schema changes handled?

Any new schemas or modifications to schemas would be required to patch the existing columns.

Pros:

  • This approach involves little effort in implementing as data could be inserted from PubSub with minimal transformation
  • All data is contained within a single table reducing the requirement for joins between tables

Cons:

  • Deeply nested data (particularly arrays) become difficult to query easily.
  • Columns contain nested data require reasonably complex defined schemas to account for possible self-describing JSON events that may occur in these columns
  • Selecting a small amount of data from a larger JSON object is more economically costly due to the BigQuery pricing being based on bytes processed

2. Shredded table approach

The shredded table approach closely mimics the current functionality when sinking Snowplow data into Redshift. One wide table is used (events) along with a number of additional tables for each context, 1 per schema per model version. Joins occur on a timestamp (collector_tstamp/root_tstamp) and an event identifier (event_id/root_id).

How are schema changes handled?

Schema changes would be largely identical to the current process with Redshift.

  • New schemas require creating a new table
  • Additions require adding columns to the shredded table
  • Revisions require patching existing columns in the shredded table (or possibly adding new columns)
  • Models require the creation of a new model table (e.g., _1, _2)

Pros:

  • This approach would maintain close query compatibility to the current Redshift shredding model
  • Queries and workloads that currently run on Redshift could be made to run on BigQuery with minimal intervention
  • There is a marginal increase in bytes processed and storage as the join keys (collector_tstamp & event_id) are stored in each table

Cons:

  • More complicated queries that join shredded tables may require additional syntax if shredded tables are also partitioned by day
  • Joins are likely to be less performance than colocalisation shredded data in the same location (note: this requires benchmarking)

3. Shredded column approach

The shredded column approach is to retain a single table (events) that contains one column per self-describing JSON per model. If a schema had two model versions in approach 2 this would involve two tables example_1 and example_2 - in the current approach this would instead be two columns example_1 and example_2 in the events table.

How are schema changed handled?

  • New schemas required patching an existing table by adding a new column
  • Additions require patching an existing column
  • Revisions require patching an existing column
  • Models require patching the table to create an additional new column per model (e.g., _1, _2).

Pros:

  • Data is colocalised in a single table for speed of access
  • No duplication of data
  • No joins between tables required

Cons:

  • In certain queries more bytes may be processed as data is nested when compared to approach 2

Porting Snowplow to Google Cloud Platform
#2

Hi Mike,

thanks for putting this up, it’s quite interesting reasoning. Right now the setup we have runs as similar to your 3rd description having “context columns”. Though as we haven’t had a need to do model patches it was always updates on existing columns. I would actually even go a step further and transform the table in favor of having to support 2 versions of the same column context as it just complicates the analysis part later as you need to remember to explicitly include all versions of changed context model columns.

The difference with 1st and 3rd approach would be that you have more levels of nesting if you try to load just raw data straight to bq?


#3

Hi Mike,

Thanks for a detail description.

From my understanding, Option 1 will contain JSON as String in a column because as you described there is no shredding and you just “dump” the data into BigQuery.

Option 3: similar to Option 1 but you shred the JSON string into nested record. “one column per self-describing JSON per model.”

Am I correct?
Thanks.

Linh


#4

Hi @linh - yes that’s correct. There will be a few columns that contain strings that contain multi-level nested JSON for Option 1, whereas Option 3 shreds out each individual context into it’s own column.

@evaldas - not too sure what you’re describing ‘in favour of having to support 2 versions…’? Option 3 would result in similarly named columns for different models but I think this is a desirable behaviour as two models are unlikely to contain backwards compatible data.


#5

@mike Sure if that’s the case, having a new column with different schema make sense. I just didn’t seen in practice that model would change to a point that it wouldn’t be possible to transform it with sql in one shape or another, like changing a type that is not backward compatible could be transformed for both old and new model usually. Or maybe I’m just not familiar with such data modeling practices. In any case good overview never the less :slight_smile: