Stream Enrich 1.1.2 released

We are glad to release Stream Enrich 1.1.2, quickly after Stream Enrich 1.1.1 got released. This is the first release from its new home snowplow/stream-enrich. Common Enrich, the underlying library used by Beam Enrich and Stream Enrich, also got moved to its new home snowplow/common-enrich.

Version 1.1.1 adds Sentry integration: if an unhandled exception is thrown by Common Enrich (even though it should never be the case), this is caught and sent to Sentry, if configured.

Version 1.1.2 fixes a bug introduced in 1.1.0 where the user agent of the HTTP request is used instead of ua query string parameter.

2 Likes

Hey @BenB!

Thanks for the quick bug fix release.

When I upgraded to R119 the stream-enrich version in the migration guide was pointing to 1.1.0 which included the useragent bug.

I’ve now updated to stream-enrich 1.1.2, but I would like to reprocess the few days of data which ended with a wrong useragent.

I was trying to use the last version of spark-enrich (which I know is not maintained anymore). I cannot run it on EMR (with a default spark installation) because it is compiled for Scala 2.12 and give the following error when starting.

java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V

I wanted to rebuild a version using scala 2.11, however the common-enrich artifact is only published for scala 2.12.

I managed to run spark_enrich locally, but I still need to reconfigure a few enrichments to get the same result from my stream-enrich pipeline.

Is there an easier way than running everything locally to reprocess these events?

Thanks in advance for your answer.

Hi @AcidFlow,

I see 2 solutions:

  1. Rebuild common-enrich with Scala 2.11 and publish it to a custom repo (e.g. local) and use it to compile Spark Enrich

  2. Provided that you archive raw data on s3, write a Flink or Spark job to read this raw data from s3 and insert it back to Kinesis raw topic.

Please be aware that both of these methods would create duplicates, as enriched events have already been emitted (but with the wrong user agent).

Is it something that you could do?

Okay actually I’m a bit lucky.

EMR 6.0.0 since its beta2, supports Spark with Scala 2.12.

So I can run spark-enrich on this EMR release.

I guess I’ll still have to tweak one or two enrichments and maybe create some bootstrap scripts to copy the GeoIP database and referer-latest.

But without these enrichments I was able to reprocess my events :slight_smile:

Good to hear! :slight_smile:

I guess I’ll still have to tweak one or two enrichments and maybe create some bootstrap scripts to copy the GeoIP database and referer-latest .

But without these enrichments I was able to reprocess my events

As long as your data comes from the main trackers (which generate event ID), and not eg. webhook (in which case event ID is generated during enrich process), then you’d likely be able to work your way around this using SQL.

Event ID in that case would be the same for the same event, so you could write some logic to create a single row per event with only the values that were correctly processed, then overwrite the two sets of half-correct events with that.

Thanks @Colm!

I like the idea, and it would work in our case as these events are issued by a Java tracker.

I’ll keep you posted on the result !

:+1: It’s probably obvious, but since a destructive operation is involved it’s best to be explicit - I would definitely output the query to a separate table and double check results before any overwrite. :slight_smile:

Hello folks!

Good news everyone, re-processed data seem to be correct! :tada:

As @Colm suggested, I only applied useragent enrichments, and then updated existing records with the re-processed values in SQL.

Thanks for the assist and have a nice day :slight_smile:

Great to hear @AcidFlow! :tada: