How to add custom business logic into Snowplow enrichment process?


#1

Dumb question, but I’ve been through as much of the documentation as I can find. I have custom events stored in S3, and want to batch process them once an hour with additional validation rules and enrichment data mappings. For example, take the contents of the URL referrer field, if known translate it to X otherwise translate it to Y. Where and how do I program the Scalding to define this as a map reduce function? I’ve found the EmrETLRunner config file, but am not seeing where the actual business logic resides.

Apologies in advance for the newbie question…


#2

Hi @davewwright,

Forking the Scalding code isn’t recommended. Have you looked at the JavaScript Script Enrichment? It’s designed for this use case.


#3

Yes, I saw that but it was not clear that this is the primary extension mechanism. So the Hadoop parallelism will be by event, and I should just do a call out to an external service that does the data translation of various parameters? No data look ups in Hadoop this way, correct?


#4

Correct - the Hadoop parallelism is by event. We are working on adding support for you writing a custom enrichment as a packaged JVM jar (so you could write it in Java or Scala), but in the meantime, yes the JavaScript enrichment is the way to go.

If you’d rather not put the logic inside the JavaScript enrichment, in R79 you’ll be able to integrate an external service holding the logic, using the API Request Enrichment.


#5

Excellent, thanks for the help. Is it safe to assume R79 with the API enrichment will be available in May?


#6

Hi @davewwright - yes it will be available within a week or so. It is undergoing final testing now…


#7

Hello there. Is it possible to do HTTP requests within this enrichment process?


#8

Yes indeed it is! Here is the documentation:


#9

@alex, I was asking is it possible to do that in Rhino Javascript enricher :slight_smile: Sorry, I shouldn’t use ‘enrichment’, I was talking about this particular javascript enricher.


#10

We want to apply our logic with javascript enricher, and if something goes wrong we need to be able to log this action by sending HTTP call.


#11

I am pretty sure it’s possible to make an HTTP call from inside the JavaScript Enrichment - but if you can, it would be cleaner to handle the error in-band, just returning an error context which will be attached to the event for further processing downstream…

It means you can run and rerun the Snowplow enrichment process without causing side effects in other systems (in functional programming terms, pure versus impure function).


#12

The whole our idea is to avoid any errors if possible and make sure it goes to the end of pipeline. We plan to achieve this goal by just normalizing incoming data (like if we got string field, but we expect integer here, we just ‘fix it here’, convert into right type and log it (We keep logs in ElasticSearch BTW). By analyzing logs we can fix problems on our code. There could be many situations (especially in early stage of developing our analytics) when just adding new field for self-describing events and contexts could lead whole data moved to bad bucket which is really not good for us.