About BigQuery startup script

Hi,

I have a question about the BigQuery mutator, loader, and repeater. The following is my startup script for the three services in one VM. I am not sure whether this is a good practice if I put them together or do I need to split them into separate VMs?

#! /bin/bash
bq_version=latest
temp_dir=snowplow-test/temp
project_id=snowplow-tes
region=us-west2
dataflow_type=n1-standard-4

sudo apt-get update
sudo apt-get -y install docker.io
sudo docker pull snowplow/snowplow-bigquery-mutator:$bq_version
sudo docker pull snowplow/snowplow-bigquery-loader:$bq_version

sudo mkdir -p snowplow/config
sudo gsutil cp gs://$temp_dir/iglu_resolver.json ./snowplow/config/
sudo gsutil cp gs://$temp_dir/bigquery_config.hocon ./snowplow/config/
sudo gsutil cp gs://$temp_dir/temp-files/credentials.json ./snowplow/config/

sudo docker run \
  -v $PWD/snowplow/config:/snowplow/config \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \
  snowplow/snowplow-bigquery-mutator:$bq_version \
  create \
  --config=$(cat $PWD/snowplow/config/bigquery_config.hocon | base64 -w 0) \
  --resolver=$(cat $PWD/snowplow/config/iglu_resolver.json | base64 -w 0)

sudo docker run \
  -v $PWD/snowplow/config:/snowplow/config \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \
  snowplow/snowplow-bigquery-mutator:$bq_version \
  listen \
  --config=$(cat $PWD/snowplow/config/bigquery_config.hocon | base64 -w 0) \
  --resolver=$(cat $PWD/snowplow/config/iglu_resolver.json | base64 -w 0) &

sudo docker run \
  -v $PWD/snowplow/config:/snowplow/config \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \
  snowplow/snowplow-bigquery-loader:$bq_version \
  --config=$(cat $PWD/snowplow/config/bigquery_config.hocon | base64 -w 0) \
  --resolver=$(cat $PWD/snowplow/config/iglu_resolver.json | base64 -w 0) \
  --runner=DataFlowRunner \
  --project=$project_id \
  --region=$region \
  --gcpTempLocation=gs://$temp_dir/temp-files \
  --maxNumWorkers=3 \
  --workerMachineType=$dataflow_type

sudo docker run \
  -v $PWD/snowplow/config:/snowplow/config \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \
  snowplow/snowplow-bigquery-repeater:$bq_version \
  --config=$(cat $PWD/snowplow/config/bigquery_config.hocon | base64 -w 0) \
  --resolver=$(cat $PWD/snowplow/config/iglu_resolver.json | base64 -w 0) \
  --bufferSize=20 \
  --timeout=20 \
  --backoffPeriod=900 \
  --verbose

And another question is, for the repeater, do I need to add ‘&’ at the end as well?

Best practice is to isolate each service to it’s own VM that way a failure in a single VM is unlikely to impact all three services and should only impact one (or you could run these on Kubernetes if that’s you want).

No, you don’t really need this unless you want to background the process - it can be helpful running interactively if you are doing other things in the same shell but isn’t really needed for an actual deployment.

Hi Mike. Thanks for your advice! I will try separating them into different VMs. And I have another question. For these three services, do I need to set autoscaling for all of them?