We have released version 0.2.0 of BigQuery Loader, our family of apps that load Snowplow data into BigQuery.
This release brings two key additions and an important bugfix.
Repeater can now be deployed instead of Forwarder
Forwarder is the tool in the Snowplow BigQuery Loader app family that has up till now been the only option for retrying failed inserts. (For more on how mutation lag can lead to failed inserts, check out the documentation.)
Forwarder is a Google Cloud Dataflow job, which makes it well suited for processing large amounts of data. However, it has several drawbacks:
- It can idle for 99.9% of the time, which can make it very expensive to run. The alternative is to manually launch it any time failed inserts appear.
- There’s no way to tell Forwarder that it should take a pause before inserting rows back. Without the pause there’s a risk that Mutator doesn’t get a chance to alter the table.
- It keeps retrying all inserts indefinitely (default behaviour for streaming Dataflow jobs).
- In order to debug a problem with Forwarder, you need to inspect Stackdriver logs.
From 0.2.0 we’re adding a new component that can be used instead of Forwarder, called Repeater.
Repeater is a JVM app, which offers several advantages over Forwarder:
- It pauses by default to allow Mutator to do its job.
- It sends rows that repeatedly fail insertion to a dead-end bucket instead of retrying them forever.
- It can be more easily debugged by inspecting the contents of the dead-end bucket, which are all valid Snowplow bad rows. (For more on bad rows, see next section.)
For more information on how to set up Repeater, consult the setup guide.
New bad row format integration
In Snowplow R118 Morgantina, our first ever beta release, we introduced a new format for “bad rows” in the Scala Stream Collector and in Enrich jobs. Version 0.2.0 now brings the new format to the BigQuery Loader family of tools as well.
Fixing bug in Schema DDL library leads to new behaviour in Loader, Mutator
This release includes a number of dependency bumps, of which the upgrade of the Schema DDL library to 0.9.0 is particularly important.
Schema DDL is a library from the Snowplow ecosystem which exposes a set of Abstract Syntax Trees and generators for producing various DDL and Schema formats. Version 0.9.0 fixes a bug that affected the creation of BigQuery table DDLs in cases where one of the fields in the schema was a nullable array, ie a property defined as having:
"type": ["array", "null"]
In older versions, Loader would have cast those fields to
STRING and Mutator would have created columns for them of type
NULLABLE STRING rather than
REPEATED RECORD, which is what we want for arrays.
This bug is fixed in the latest versions of the two components. However, if you already have nullable array-typed fields in your schemas, some incompatibility might have been introduced.
It is possible that an older version of Loader has cast those fields to
STRING and that Mutator has created
NULLABLE STRING columns for them. After upgrading to 0.2.0, Loader will no longer cast the value in those fields to
STRING and so they will not be able to be inserted in the existing columns for them.
There are two ways this can be handled:
by introducing a new schema version that gets rid of the
by migrating all the data in the BigQuery table to a new table, with a schema that fixes the “stringified” column.
Introducing a new schema
You can upgrade the affected schemas to a new version, without really changing anything in the schemas. The new version of Loader will not cast these values to
STRING. Because the schemas have new versions, Mutator will create new columns for them and they will have the desired type of
Migrating the data
Both the type and the mode of the affected columns needs to be changed (so we go from
NULLABLE STRING to
Changing the type of column is not currently supported in BigQuery. To do it manually, you can:
use a SQL query that casts the data to the desired type and use the output of the query to create a new table (but this won’t work for changing the mode, see below);
unload the data from the table to GCS and use it to create a new BigQuery table with the desired proper schema.
Changing the mode of a column is currently only supported for going from
NULLABLE. Any other changes can only be done by unloading the data to GCS and then loading it into a new table with the desired schema.
(For the full details, refer to the GCP documentation: https://cloud.google.com/bigquery/docs/manually-changing-schemas.)