We are glad to release Stream Enrich 1.1.2, quickly after Stream Enrich 1.1.1 got released. This is the first release from its new home snowplow/stream-enrich. Common Enrich, the underlying library used by Beam Enrich and Stream Enrich, also got moved to its new home snowplow/common-enrich.
Version 1.1.1 adds Sentry integration: if an unhandled exception is thrown by Common Enrich (even though it should never be the case), this is caught and sent to Sentry, if configured.
Version 1.1.2 fixes a bug introduced in 1.1.0 where the user agent of the HTTP request is used instead of ua query string parameter.
When I upgraded to R119 the stream-enrich version in the migration guide was pointing to 1.1.0 which included the useragent bug.
I’ve now updated to stream-enrich 1.1.2, but I would like to reprocess the few days of data which ended with a wrong useragent.
I was trying to use the last version of spark-enrich (which I know is not maintained anymore). I cannot run it on EMR (with a default spark installation) because it is compiled for Scala 2.12 and give the following error when starting.
I guess I’ll still have to tweak one or two enrichments and maybe create some bootstrap scripts to copy the GeoIP database and referer-latest .
But without these enrichments I was able to reprocess my events
As long as your data comes from the main trackers (which generate event ID), and not eg. webhook (in which case event ID is generated during enrich process), then you’d likely be able to work your way around this using SQL.
Event ID in that case would be the same for the same event, so you could write some logic to create a single row per event with only the values that were correctly processed, then overwrite the two sets of half-correct events with that.
It’s probably obvious, but since a destructive operation is involved it’s best to be explicit - I would definitely output the query to a separate table and double check results before any overwrite.