Hosting referers.yml

seanhall · October 11, 2017, 7:17pm

I’m attempting to run the referer-parser enrichment off my own referer list. I am still running Hadoop enrich. I pulled down all of s3://snowplow-hosted-assets/ to my own s3 bucket and pointed my EMR config there, but I suspect that the reference to https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-latest.yml is hard coded in one place, one way or another, and that simply pointing config.yml to my own bucket won’t change the source file reference. Has anyone tried this? How do you know that you’re referencing your copy or the remote? And how do you update the reference if that is indeed required?

seanhall · October 11, 2017, 8:25pm

So far, after syncing my bucket with s3://snowplow-hosted-assets/, I’ve:

pulled snowplow-hadoop-enrich-1.8.0.jar down to local
updated my local hadoop enrich jarfile with my local referers.yml: jar uf snowplow-hadoop-enrich-1.8.0.jar referers.yml
I’m uploading this jarfile to my hosted assets bucket to replace the stock jarfile
I’ll again update my config.yml to point to my bucket and run our dev pipeline

BenFradet · October 12, 2017, 8:52am

As you found out the yaml is embedded in the referer parser jar. Tell us how it goes!

alex · October 12, 2017, 4:35pm

Just to confirm that the file:

https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-latest.yml

is us laying the foundation for a future state where the referers.yml can be dynamically retrieved at runtime. At the moment however, the referers.yml file is still embedded inside the Spark Enrich jar.

seanhall · October 16, 2017, 7:12pm

After following the above steps to update the Hadoop jarfile with the updated referers.yml, I’m afraid I saw two very concerning things: 1) the EMR enrichment time spiked 5x or more, up to 2.5 hours on 6 core instances, and 2) much of the data seems to have been lost, I presume to the bad bucket, though I haven’t finished tracing it yet. I have no idea why this fairly small change would cause either, but I reverted after just 2-3 enrichment cycles.

alex · October 16, 2017, 7:27pm

Sorry to hear that @seanhall! It’s always worth making any kinds of code changes in your staging environment first.

seanhall · October 16, 2017, 10:38pm

We did and didn’t see this issue, but our dev volume is <1% of production.

Topic		Replies	Views
How to add custom business logic into Snowplow enrichment process? Enrichment	11	2865	January 6, 2017
IP Lookup enrichment for quick start GCP Enrichment	6	753	November 2, 2022
Minimal Enrich Setup? Enrichment	4	2747	June 29, 2017
Collector -> S3 loader Collectors	3	1345	June 7, 2020
Campaign attribution enrichment + my own parameters Enrichment	2	458	October 25, 2023

Hosting referers.yml

Related Topics