Snowplow BigQuery Loader 1.0.0 released

We’re very excited to have released version 1.0.0 of Snowplow BigQuery Loader, our family of apps that load Snowplow data into BigQuery.

The highlight of this release is the StreamLoader app, which has shed its experimental status and can now be deployed in anger. We have significantly improved its performance from the earlier version and it can now more than hold its own compared with the Dataflow-based Loader.

If you’re new to Snowplow and want to understand what the different apps do, the documentation pages are a good place to start.

New configuration format

This release brings a breaking change to the configuration format for all applications. Rather than passing that in as a self-describing JSON, the apps now expect a HOCON file.

See the setup guide and upgrade guide for more information.

New load_tstamp column

We’ve added a much-requested change by introducing a load_tstamp field to all events loaded into BigQuery. This timestamp represents the time when the data arrived in the warehouse and can be used for incremental processing of new data in data modeling.

This change is backwards compatible. If you downgrade back to 0.6.4, the load_tstamp column will remain in your table but any data loaded will have a null value for it.

The new column is created by Mutator automatically on startup. It can occasionally take some time for it to become visible to all workers trying to write to the table. For this reason, we recommend that you upgrade Mutator first, before you upgrade the loader app (regardless of whether you’re using Loader or StreamLoader).

Mutator can now create partitioned tables

You can now use Mutator’s create command to set up partitioned BigQuery tables by specifying a partition column and optionally enforcing a partition filter on all queries.

See the Mutator documentation for details.

Handling of unallowed characters in BigQuery column names

We’ve integrated a change from our schema-ddl library, which improved handling of invalid field names, and in particular fields that start with a numeric character.

This fixes a known problem when trying to load a openweathermap event.

Now, if your a schema contains a field called 1h then it will be loaded with the name _1h whereas previously it would not be loaded at all.

Metrics

StreamLoader and Repeater emit metrics using the StatsD protocol. The available metrics are:

  • number of events loaded into BigQuery by StreamLoader (good)
  • number of failed events (bad)
  • number of failed inserts (failed_inserts)
  • number of events that Repeater could not ultimately load into BigQuery (uninsertable)
  • max time elapsed between collector_tstamp and now(), measured when StreamLoader receives a response to its insert request from BigQuery (latency).

To see what these look like, you can start StreamLoader or Repeater locally, specifying the setting in config along the lines of:

"monitoring": {
    "statsd": {
      "hostname": "localhost"
      "port": 1024
      "tags": {}
      "period": "5 sec"
      "prefix": "snowplow.monitoring"
    }
  }

In a separate tab, run Netcat to listen to UDP traffic:

$ nc -z -v -u localhost 1024 // connect to port
$ nc -l -u 1024 // listen on port

Other improvements and changes

Alongside small bugfixes and dependency bumps, we’ve also now started publishing arm64 and amd64 docker images.

Forwarder, which was deprecated in 0.5.0, has now been completely removed.

For the full list of changes and jar files, see the release notes:

Thanks

Many thanks to Alex Fainshtein for contributing to this release.

5 Likes

Hi all,
is there any downside of using the load_tstamp for incremental modeling in BigQuery? As fare as I can see, it makes a lot of sense to use it instead of the default collector_tstamp from the SQL runner model e.g. for simpler reprocessing of failed events. @jrpeck1989 I think you once mentioned something similar during a meeting. Curious about your feedback…

No real downside - it’s definitely the better timestamp to be using for incremental models as both collector_tstamp and derived_tstamp aren’t guaranteed to capture events in the loaded delta.

3 Likes

Thanks mike! Great that my assumption is confirmed…so we’re going to rebuild the model based on load_tstamp in dataform.

Just to chime in with some extra context here @davidher_mann, the reason the dbt packages don’t use load_tstamp for incremental modelling is because this timestamp is only available on newer loaders, and we can’t guarantee that all users have this to be able to leverage it. We are looking into how we can update the dbt packages in a v1.0.0 manner to make the incremental timestamp customizable, so that you can easily pick it when setting up your dbt package.

3 Likes

And to add even more context, we’re looking at modernising our dataform package (which we haven’t maintained whilst Dataforms been on hiatus) to be more like our current dbt packages. This also likely means we’ll bring load_tstamp support to it too.

I don’t have any time frames yet but it’s on our nearterm roadmap.

2 Likes

Thanks Paul and Emil for the additional context. We are happy to working with you to build a new dataform model. We already transformed the initial dataform model a lot, to leverage more dataform functionality, optimize bot exclusion, performance etc.

1 Like

We’ll reach out shortly!