Snowplow BigQuery Loader 1.0.0 released

We’re very excited to have released version 1.0.0 of Snowplow BigQuery Loader, our family of apps that load Snowplow data into BigQuery.

The highlight of this release is the StreamLoader app, which has shed its experimental status and can now be deployed in anger. We have significantly improved its performance from the earlier version and it can now more than hold its own compared with the Dataflow-based Loader.

If you’re new to Snowplow and want to understand what the different apps do, the documentation pages are a good place to start.

New configuration format

This release brings a breaking change to the configuration format for all applications. Rather than passing that in as a self-describing JSON, the apps now expect a HOCON file.

See the setup guide and upgrade guide for more information.

New load_tstamp column

We’ve added a much-requested change by introducing a load_tstamp field to all events loaded into BigQuery. This timestamp represents the time when the data arrived in the warehouse and can be used for incremental processing of new data in data modeling.

This change is backwards compatible. If you downgrade back to 0.6.4, the load_tstamp column will remain in your table but any data loaded will have a null value for it.

The new column is created by Mutator automatically on startup. It can occasionally take some time for it to become visible to all workers trying to write to the table. For this reason, we recommend that you upgrade Mutator first, before you upgrade the loader app (regardless of whether you’re using Loader or StreamLoader).

Mutator can now create partitioned tables

You can now use Mutator’s create command to set up partitioned BigQuery tables by specifying a partition column and optionally enforcing a partition filter on all queries.

See the Mutator documentation for details.

Handling of unallowed characters in BigQuery column names

We’ve integrated a change from our schema-ddl library, which improved handling of invalid field names, and in particular fields that start with a numeric character.

This fixes a known problem when trying to load a openweathermap event.

Now, if your a schema contains a field called 1h then it will be loaded with the name _1h whereas previously it would not be loaded at all.

Metrics

StreamLoader and Repeater emit metrics using the StatsD protocol. The available metrics are:

  • number of events loaded into BigQuery by StreamLoader (good)
  • number of failed events (bad)
  • number of failed inserts (failed_inserts)
  • number of events that Repeater could not ultimately load into BigQuery (uninsertable)
  • max time elapsed between collector_tstamp and now(), measured when StreamLoader receives a response to its insert request from BigQuery (latency).

To see what these look like, you can start StreamLoader or Repeater locally, specifying the setting in config along the lines of:

"monitoring": {
    "statsd": {
      "hostname": "localhost"
      "port": 1024
      "tags": {}
      "period": "5 sec"
      "prefix": "snowplow.monitoring"
    }
  }

In a separate tab, run Netcat to listen to UDP traffic:

$ nc -z -v -u localhost 1024 // connect to port
$ nc -l -u 1024 // listen on port

Other improvements and changes

Alongside small bugfixes and dependency bumps, we’ve also now started publishing arm64 and amd64 docker images.

Forwarder, which was deprecated in 0.5.0, has now been completely removed.

For the full list of changes and jar files, see the release notes:

Thanks

Many thanks to Alex Fainshtein for contributing to this release.

5 Likes