[RFC] Big Query Loader (Google Cloud Dataflow version) deprecation

As mentioned in the BigQuery Loader document page, Snowplow currently has two docker images of the loader running:

  1. Snowplow BigQuery StreamLoader, a standalone Scala app that can be deployed on Google Kubernetes Engine.
  2. Snowplow BigQuery Loader, an alternative to StreamLoader, in the form of a Google Cloud Dataflow job.

It has been decided that over the coming months we will deprecate the raw Dataflow Beam Snowplow BigQuery Loader (Number 2 on the list above).

Since its inception, the BigQuery Loader application was designed to load large amounts of enriched Snowplow data into BigQuery. We have since optimized this by releasing the BigQuery StreamLoader.

This was released with the aim to fully replace the Beam BigQuery Loader. Since its inception, the Snowplow pipeline was designed to handle massive amounts of data. Now we’re making changes to optimize the Snowplow pipeline for the future. Our blog post announcing our strategy to move away from BigData Frameworks in favor of functional streams can be found below.

Snowplow BigQuery StreamLoader has already been rolled out successfully, but before we deprecate the older loader docker image, we would like to make sure that we are not missing something. If you are still using the loader and can not migrate to streamloader, please let us know.

The date of deprecation (removed from the codebase) for the Dataflow loader will be 7 October 2022.

5 Likes