I’m attempting to run the referer-parser enrichment off my own referer list. I am still running Hadoop enrich. I pulled down all of s3://snowplow-hosted-assets/ to my own s3 bucket and pointed my EMR config there, but I suspect that the reference to https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-latest.yml is hard coded in one place, one way or another, and that simply pointing config.yml to my own bucket won’t change the source file reference. Has anyone tried this? How do you know that you’re referencing your copy or the remote? And how do you update the reference if that is indeed required?
So far, after syncing my bucket with s3://snowplow-hosted-assets/, I’ve:
- pulled snowplow-hadoop-enrich-1.8.0.jar down to local
- updated my local hadoop enrich jarfile with my local referers.yml:
jar uf snowplow-hadoop-enrich-1.8.0.jar referers.yml
- I’m uploading this jarfile to my hosted assets bucket to replace the stock jarfile
- I’ll again update my config.yml to point to my bucket and run our dev pipeline
As you found out the yaml is embedded in the referer parser jar. Tell us how it goes!
Just to confirm that the file:
is us laying the foundation for a future state where the referers.yml can be dynamically retrieved at runtime. At the moment however, the referers.yml file is still embedded inside the Spark Enrich jar.
After following the above steps to update the Hadoop jarfile with the updated referers.yml, I’m afraid I saw two very concerning things: 1) the EMR enrichment time spiked 5x or more, up to 2.5 hours on 6 core instances, and 2) much of the data seems to have been lost, I presume to the bad bucket, though I haven’t finished tracing it yet. I have no idea why this fairly small change would cause either, but I reverted after just 2-3 enrichment cycles.
Sorry to hear that @seanhall! It’s always worth making any kinds of code changes in your staging environment first.
We did and didn’t see this issue, but our dev volume is <1% of production.