Running beam-enrich and bq-loader/mutators on Kubernetes

Hi all,

Currently we have set up the “simple” version of the GCP open source setup using instance templates. What we want to do now is to create a Kubernetes setup for which all docker images are run on.

I have successfully run the scala-stream-collector as a docker image on a Kubernetes Cluster on GCP (GKE), however struggling with the beam enrich deployment.

This is my deployment yaml for enrichment:

---
apiVersion: "v1"
kind: "Namespace"
metadata:
  name: "default"
---
apiVersion: "apps/v1"
kind: "Deployment"
metadata:
  name: "beam-enrich-dm"
  namespace: "default"
  labels:
    app: "beam-enrich-dm"
spec:
  progressDeadlineSeconds: 1200
  replicas: 2
  selector:
    matchLabels:
      app: "beam-enrich-dm"
  template:
    metadata:
      labels:
        app: "beam-enrich-dm"
    spec:
      containers:
      - name: "beam-enrich-dm"
        image: "docker.io/snowplow/beam-enrich:1.3.1"
        args: ["--runner", "DataFlowRunner", --streaming", "true", "--project", "XXX", "--zone", "europe-west1-d", --gcpTempLocation", "gs://my-bucket/", "--job-name", "beam-enrich", --raw", "projects/XXX/subscriptions/good-sub", "--enriched", "projects/XXX/topics/enriched-good", "--bad", "projects/XXX/topics/enriched-bad", "--pii", "projects/XXX/topics/pii-topic", "--enrichments", "/snowplow/enrichments/", "--resolver", "/snowplow/resolver/iglu_resolver.json", "--workerMachineType",  "n1-standard-1", "--diskSizeGb", "30", "serviceAccount", "myserviceaccount@..."]
        env:
        volumeMounts:
        - name: enrichments
          mountPath: /snowplow/enrichments
        - name: resolver
          mountPath: /snowplow/resolver
      volumes:
      - name: resolver
        configMap:
          name: resolver
      - name: enrichments
        configMap:
          name: enrichments

These are my config maps

enrichments.yaml:

apiVersion: v1
data:
  anon_ip.json: |
    {
    	"schema": "iglu:com.snowplowanalytics.snowplow/anon_ip/jsonschema/1-0-1",
    	"data": {
    		"name": "anon_ip",
    		"vendor": "com.snowplowanalytics.snowplow",
    		"enabled": true,
    		"parameters": {
    			"anonOctets": 2,
    			"anonSegments": 1
    		}
    	}
    }
  pii_pseudo.json: |
    {
      "schema": "iglu:com.snowplowanalytics.snowplow.enrichments\/pii_enrichment_config\/jsonschema\/2-0-0",
      "data": {
        "vendor": "com.snowplowanalytics.snowplow.enrichments",
        "name": "pii_enrichment_config",
        "emitEvent": true,
        "enabled": true,
        "parameters": {
          "pii": [
            {
              "pojo": {
                "field": "user_id"
              }
            }
          ],
          "strategy": {
            "pseudonymize": {
              "hashFunction": "XXX",
              "salt": "XXXX"
            }
          }
        }
      }
    }
kind: ConfigMap
metadata:
  name: enrichments

resolver.yaml

kind: ConfigMap
metadata:
  name: resolver
apiVersion: v1
data:
      resolver.json: |-
          {
            "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
            "data": {
              "cacheSize": 500,
              "repositories": [
                {
                  "name": "Iglu Central",
                  "priority": 0,
                  "vendorPrefixes": [ "com.snowplowanalytics" ],
                  "connection": {
                    "http": {
                      "uri": "http://iglucentral.com"
                    }
                  }
                },
                {
                  "name": "Iglu Central - GCP Mirror",
                  "priority": 1,
                  "vendorPrefixes": [ "com.snowplowanalytics" ],
                  "connection": {
                    "http": {
                      "uri": "http://mirror01.iglucentral.com"
                    }
                  }
                }
              ]
            }
          }

This is how its deployed using Google CloudBuild:

  - name: snowplow-cloudbuild-deploy-beam-enrich
    type: cloudbuild.py
    properties:
      steps:
        - name: 'gcr.io/cloud-builders/gcloud'
          args:
          - source
          - repos
          - clone
          - snowplow-sandbox
          - --project=XXX
        - name: "gcr.io/cloud-builders/gke-deploy"
          args:
          - run
          - --filename=snowplow-sandbox/iac-setup/k8s/beam-enrich/beam-enrich-dm.yaml
          - --location=europe-west1-d
          - --cluster=id_of_cluster

This is the error I get from the CloudBuild job

Expanding configuration files.

Saving expanded configuration files to "output/expanded"

Finished preparing deployment.

Applying deployment.

Getting access to cluster "id_of_cluster" in "europe-west1-d".

Configuration files to be used: [{kind: Deployment, name: beam-enrich-dm} {kind: Namespace, name: default}]

Applying configuration files to cluster.

Waiting for deployed objects to be ready with timeout of 5m0s

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Still waiting on 1 object(s) to be ready: [{kind: Deployment, name: beam-enrich-dm}]

Finished applying deployment.

################################################################################

> Deployed Objects

NAMESPACE    KIND          NAME              READY    

default      Deployment    beam-enrich-dm    No       

################################################################################

> GKE

Workloads:             https://console.cloud.google.com/kubernetes/workload?project=XXX

Services & Ingress:    https://console.cloud.google.com/kubernetes/discovery?project=XXX

Applications:          https://console.cloud.google.com/kubernetes/application?project=XXX

Configuration:         https://console.cloud.google.com/kubernetes/config?project=XX
Storage:               https://console.cloud.google.com/kubernetes/storage?project=XXX

Error: failed to apply deployment: timed out after 5m0s while waiting for deployed objects to be ready

Any k8s + snowplow expertees out there? :slight_smile: Thanks!

Brian