GCP: Ideal setup


I’m wondering about the setup of Snowplow on GCP where you could possible help me. For example what are the recommended sizes for compute engines?

  • Collector?
  • Beam Enrich?
  • Big Query Loader? Mutator?

We do have around 100 million hits per month.

Is it possible to run Enrich and BQ Loader on one compute engine or do I need separate ones for every job?

What is the job of the BQ forwarder?

Is it possible to use Cloud Storage and BigQuery in parallel or does the BigQuery Loader replace the Storage loading?

Also does the web model exist for BigQuery somewhere?

Thanks for your help!


This will depend on not just your volume but the number of bytes you are sending with each request. You want this to be autoscaling but for something with this volume you are probably fine with a few n1-standard-1s or fewer n1-standard-2s.

Beam Enrich and Big Query loader both run on Dataflow (which uses Compute Engine under the hood) and there is a setting to autoscale workers here. You’ll want to make sure you set a maxWorkers setting here but in general these jobs are quite efficient in terms of number of workers required.

You should run these as two separate Dataflow jobs - each one will have it’s own compute under the hood which Dataflow will manage for you.

To forward failed inserts - most commonly due to table mutations in which an event may have columns that do not exist yet in the destination table.

There’s nothing to stop you running both in parallel if required - particularly if you are doing batch inserts into BigQuery rather than streaming inserts. For streaming inserts from PubSub to BigQuery you don’t really need to persist the events to Cloud Storage - though you can if required.

Not yet as far as I’m aware.

1 Like

Wow, thank you for the elaborate answers!