Enrichment upgrade issues and other choices comparison

Hi,

Here is the thing. I used Beam Enrich for a long time with the version of 1.2.3. Today, when I tried to upgrade to the latest (3.0.0-rc10), there were some errors coming. The following is the startup script and error log. I am not quite sure the reason. And I also tried version 2.0.2, it worked. Where do I need to change if I still want to upgrade to version3?

And another question is, I found there is a new way called Enrich PubSub. What is the difference between Enrich PubSub and Beam Enrich? Which one do you suggest if my website will have a large amount of data flow? Thanks!

/$ sudo docker run \
  -v /snowplow/config:/snowplow/config \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \
  snowplow/beam-enrich:$enrich_version \
  --runner=DataFlowRunner \
  --project=$project_id \
  --region=$region \
  --streaming=true \
  --gcpTempLocation=gs://$temp_dir/temp-files/ \
  --job-name=beam-enrich \
  --raw=projects/$project_id/subscriptions/good-sub \
  --enriched=projects/$project_id/topics/enriched-good \
  --bad=projects/$project_id/topics/enriched-bad \
  --pii=projects/$project_id/topics/pii-good \
  --resolver=/snowplow/config/iglu_resolver.json \
  --enrichments=/snowplow/config/enrichments/ \
  --workerMachineType=$dataflow_type
params:  /opt/docker/bin/beam-enrich --runner=DataFlowRunner --project=snowplow-test --region=us-west1 --streaming=true --gcpTempLocation=gs://snowplow-test/temp/temp-files/ --job-name=beam-enrich --raw=projects/snowplow-test/subscriptions/good-sub --enriched=projects/snowplow-test/topics/enriched-good --bad=projects/snowplow-test/topics/enriched-bad --pii=projects/snowplow-test/topics/pii-good --resolver=/snowplow/config/iglu_resolver.json --enrichmen
ts=/snowplow/config/enrichments/ --workerMachineType=n1-standard-4 threshold: 5 delay: 3 GOOGLE_APPLICATION_CREDENTIALS: /snowplow/config/credentials.json
Activated service account credentials for: [datamodeling2@snowplow-test.iam.gserviceaccount.com]
gs://snowplow-test/temp/temp-files/credentials.json
Bucket gs://snowplow-test/temp/temp-files/ exists! Proceeding.
Exception in thread "main" java.lang.NoClassDefFoundError: io/netty/internal/tcnative/AsyncSSLPrivateKeyMethod
        at io.netty.handler.ssl.SslContext.newClientContextInternal(SslContext.java:831)
        at io.netty.handler.ssl.SslContextBuilder.build(SslContextBuilder.java:611)
        at com.spotify.scio.pubsub.PubSubAdmin$GrpcClient$.newChannel(PubSubAdmin.scala:43)
        at com.spotify.scio.pubsub.PubSubAdmin$GrpcClient$.publisher(PubSubAdmin.scala:58)
        at com.spotify.scio.pubsub.PubSubAdmin$.topic(PubSubAdmin.scala:88)
        at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.checkTopicExists(Enrich.scala:327)
        at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.$anonfun$main$2(Enrich.scala:77)
        at scala.util.Either.flatMap(Either.scala:341)
        at com.snowplowanalytics.snowplow.enrich.beam.Enrich$.main(Enrich.scala:75)
        at com.snowplowanalytics.snowplow.enrich.beam.Enrich.main(Enrich.scala)
Caused by: java.lang.ClassNotFoundException: io.netty.internal.tcnative.AsyncSSLPrivateKeyMethod
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 10 more

Hi @phxtorise,

The latest version of Enrich is 2.0.2. v3 is currently in a development stage and isn’t recommended for usage.

On your second question:

Enrich PubSub is a standalone JVM application that reads and writes from PubSub topics. It can be run from anywhere, as long as it has permissions to access the topics. For example, run it as a Kubernetes job, or on a GCP compute instance, or even just from your laptop.

Beam Enrich is built on top of Apache Beam and it runs on GCP’s Dataflow. It can be run from anywhere, as long as it can communicate with Dataflow and have enough permissions to create a Dataflow job. For example, run it as a Kubernetes job or from a Compute Engine instance.

To understand the difference a bit more, you can look at Enrich 2.0.0 released!. At the moment both assets are being maintained but Beam Enrich can be deprecated at some point. Given that, it’s probably better to switch to Enrich PubSub. From our experience it can handle the same data volume and in most cases it will be cheaper.

Best,

2 Likes

Thanks @egor! I will try the Enrich Pubsub.

Hi @egor. I have another question about the loader choices in GCP. I notice there are two choices for the loader, one is Snowplow BigQuery StreamLoader, another is Snowplow BigQuery Loader. I am currently using the second one. But as I know, Snowplow BigQuery Loader is also an Apache Beam job intended to run on Google Cloud Dataflow. So I am wondering whether it will also be deprecated one day, and what’s your suggestion on the loader selection? Thanks!

Hi @phxtorise, I would recommend the BigQuery StreamLoader. It runs as a standalone application, rather than as a dataflow job, so you will save yourself a lot of $$$ on cloud costs.

I expect the dataflow version will one day be deprecated. We have not deprecated it yet because the stream loader was only released recently, so we will support them both for a while longer. The stream loader is more in line with our future vision of the Snowplow GCP pipeline.