BigQuery Loader 0.2.0 released

We have released version 0.2.0 of BigQuery Loader, our family of apps that load Snowplow data into BigQuery.

This release brings two key additions and an important bugfix.

Repeater can now be deployed instead of Forwarder

Forwarder is the tool in the Snowplow BigQuery Loader app family that has up till now been the only option for retrying failed inserts. (For more on how mutation lag can lead to failed inserts, check out the documentation.)

Forwarder is a Google Cloud Dataflow job, which makes it well suited for processing large amounts of data. However, it has several drawbacks:

  • It can idle for 99.9% of the time, which can make it very expensive to run. The alternative is to manually launch it any time failed inserts appear.
  • There’s no way to tell Forwarder that it should take a pause before inserting rows back. Without the pause there’s a risk that Mutator doesn’t get a chance to alter the table.
  • It keeps retrying all inserts indefinitely (default behaviour for streaming Dataflow jobs).
  • In order to debug a problem with Forwarder, you need to inspect Stackdriver logs.

From 0.2.0 we’re adding a new component that can be used instead of Forwarder, called Repeater.

Repeater is a JVM app, which offers several advantages over Forwarder:

  • It pauses by default to allow Mutator to do its job.
  • It sends rows that repeatedly fail insertion to a dead-end bucket instead of retrying them forever.
  • It can be more easily debugged by inspecting the contents of the dead-end bucket, which are all valid Snowplow bad rows. (For more on bad rows, see next section.)

For more information on how to set up Repeater, consult the setup guide.

New bad row format integration

In Snowplow R118 Morgantina, our first ever beta release, we introduced a new format for “bad rows” in the Scala Stream Collector and in Enrich jobs. Version 0.2.0 now brings the new format to the BigQuery Loader family of tools as well.

(For more details on the new bad row format, check out the RFC and the R118 release post.)

Fixing bug in Schema DDL library leads to new behaviour in Loader, Mutator

This release includes a number of dependency bumps, of which the upgrade of the Schema DDL library to 0.9.0 is particularly important.

Schema DDL is a library from the Snowplow ecosystem which exposes a set of Abstract Syntax Trees and generators for producing various DDL and Schema formats. Version 0.9.0 fixes a bug that affected the creation of BigQuery table DDLs in cases where one of the fields in the schema was a nullable array, ie a property defined as having:

"type": ["array", "null"]

In older versions, Loader would have cast those fields to STRING and Mutator would have created columns for them of type NULLABLE STRING rather than REPEATED RECORD, which is what we want for arrays.

Upgrading

This bug is fixed in the latest versions of the two components. However, if you already have nullable array-typed fields in your schemas, some incompatibility might have been introduced.

It is possible that an older version of Loader has cast those fields to STRING and that Mutator has created NULLABLE STRING columns for them. After upgrading to 0.2.0, Loader will no longer cast the value in those fields to STRING and so they will not be able to be inserted in the existing columns for them.

There are two ways this can be handled:

  • by introducing a new schema version that gets rid of the [array, null] type;

  • by migrating all the data in the BigQuery table to a new table, with a schema that fixes the “stringified” column.

Introducing a new schema

You can upgrade the affected schemas to a new version, without really changing anything in the schemas. The new version of Loader will not cast these values to STRING. Because the schemas have new versions, Mutator will create new columns for them and they will have the desired type of REPEATED RECORD.

Migrating the data

Both the type and the mode of the affected columns needs to be changed (so we go from NULLABLE STRING to REPEATED RECORD).

Changing the type of column is not currently supported in BigQuery. To do it manually, you can:

  • use a SQL query that casts the data to the desired type and use the output of the query to create a new table (but this won’t work for changing the mode, see below);

  • unload the data from the table to GCS and use it to create a new BigQuery table with the desired proper schema.

Changing the mode of a column is currently only supported for going from REQUIRED to NULLABLE. Any other changes can only be done by unloading the data to GCS and then loading it into a new table with the desired schema.

(For the full details, refer to the GCP documentation: https://cloud.google.com/bigquery/docs/manually-changing-schemas.)

3 Likes

Hi @dilyan ,

what is the best way to go about checking the schema regarding the Nullable String problem?

Is any of the Snowplow standard enrichments affected from this? I know my own iglu schemas don’t use [array, null]

Thank you!

@volderette ~if you’re not using [array, null] in any of your custom schemas, you are safe.~

~If you want to verify it for piece of mind. you can take a look at you BQ loader’s failed inserts to investigate if any are down to the [array,null] problem, but it can only occur if you have defined a field as such, so I wouldn’t expect any issues.~

Edit: My bad, I overlooked the few Iglu schemas that do indeed have this form. For posterity - as long as you’ve not yet been using the below schemas (the focus_form one is to do with javascript tracker’s form tracking), and are not using [array, null] in a custom schema, you shouldn’t be affected.

@volderette To add to what Colm said, there are no standard enrichments that use a schema with the nullable array type.

There are a few schemas on Iglu Central that use it:

vendor schema name schema version
com.snowplowanalytics.snowplow focus_form 1-0-0
com.getvero updated 1-0-0
com.google.cloud.gce instance_metadata 1-0-0
com.amazon.aws.ec2 instance_identity_document 1-0-0
2 Likes

Also, here’s a quick Python script to check if any of your schemas are using it:

import json
import os

files = []

path = "path/to/local/dir/with/schemas"

for (root, dirnames, filenames) in os.walk(path):
    for filename in filenames:
        files.append(os.path.join(root, filename))

for filename in files:
    with open(filename, 'r') as file:
        data = file.read()
        parsed = json.loads(data)

        def recursive_check(schema, filename):
            for k, v in schema.items():
                if k == "type":
                    if isinstance(v, list):
                        if all(x in v for x in ["array", "null"]):
                            print("Found it in " + filename)
                    if isinstance(v, dict):
                        recursive_check(v, filename)
                if k != "type":
                    if isinstance(v, list):
                        for i in range(len(v)):
                            if isinstance (v[i], dict):
                                recursive_check(v[i], filename)
                    if isinstance(v, dict):
                        recursive_check(v, filename)

        recursive_check(parsed, filename)
2 Likes

Thank you so much @dilyan @Colm! I really appreciate the support you are providing! I could upgrade with no problems!