BigQuery Loader 0.5.0 released

We have released version 0.5.0 of BigQuery Loader, our family of apps that load Snowplow data into BigQuery.

This release focuses on under-the-hood improvements, and we’ve also deprecated the BigQuery Forwarder.

BigQuery Forwarder deprecation

Forwarder was originally the component responsible for retrying failed inserts into BigQuery. In version 0.2.0 we added Repeater, as a more efficient, easier-to-debug alternative. We’ve been recommending people use Repeater ever since. From this version, we’re deprecating Forwarder and it will no longer be maintained.

Replacing Dataflow metrics with Dropwizard metrics

In version 0.4.0 we added a custom Dataflow metric to measure the latency between when an event hits a Snowplow collector and when it gets loaded into BigQuery. The metric was registered as a Distribution, and exposed useful summary statistics, such as MIN, MAX and MEAN values. However, we realised that these metrics (exposed in Stackdriver) applied to the entire lifetime of the Dataflow job. This meant they couldn’t be used to answer questions like: “What was the maximum latency in the last 10 minutes?” or “How has the average latency over the past minute changed compared with the minute before that?”

We’ve therefore decided to switch to using Dropwizard Metrics instead. We evaluated different approaches, with a focus on ensuring the metrics can be easily exposed in Stackdriver, without having to worry about additional consumers or connectors. In the end, we chose to go with a Gauge type metric, which samples the latency every second. This data is recorded in the job logs and can be inspected in GCP’s Logs Viewer. The Logging interface also allows you to create a custom logs-based metric, which is then viewable in Stackdriver Monitoring.

Over a period of one minute, you’ll get a sample of 60 latency readings, which should be enough to produce accurate summary statistics.

Other improvements

Version 0.5.0 also comes with improved Repeater logs and better caching when validating events against their schema. For the full details, please check out the release notes:

Upgrading

No changes in configuration are required to upgrade from version 0.4.2.

If you are upgrading from version 0.2.0 or lower, please refer to the the upgrade notes in the 0.2.0 release post for some potential pitfalls.

4 Likes

I know there was a lot of work that went into the new functionality for metrics in this release based on some of the limitations in Stackdriver and Scio. Nice work @dilyan and team!

1 Like