This is more of a query whether my idea can be utilized in Snowplow trackers. My idea is that, I would like to filter the data using some external API so that snowplow collect only specific type of data. I tried to send data to my AWS collector , through another third party URL but it is not working.
My query is , is my idea valid ? Will snowplow allow such filtering?
Remember, I am only trying this for web trackers.
Can you give an example of what sort of filtering logic you are trying to employ? Some filtering may require an external service vs some filtering that may be more appropriate at either the tracker or load balancer level.
My filtering is external. Let me explain.
- My collector is ‘c1.example.com’
- My filtering service is running at ‘filter.example.com’
3.I am sending tracking data to ‘filter.example.com’ and in this filtering server, I redirecting all the hits towards my collector url(c1.example.com).
Now the issue is that, I am facing 404 errors.
Is there any documentation available for the file sp.js? We can try modifying it to allow different filtering using the same file.
- Stop users tracking data from certain countries.
2.Stop few texts which not complying with our reporting system such as this : ‘ợc+hợp+nhất|’. These texts breaks the reporting files such as CSV, TSV
- Blocking some kind of contents for example porn content etc.
Thanks for your time and effort.
Do you know where this data is coming from? Everything in the Snowplow pipeline should be UTF-8 so this shouldn’t break loading or processing any parts of the pipeline but you may want to filter it out somewhere depending on how it’s being sent / if it’s expected (you could likely due this in your schema definitions).
Is it porn URLs that are being sent through in fields or something else? This can be a trickier one to remove but I suspect the best place to do this would be a custom enrichment that flags adult sites and removes / redacts or drops the event depending on your desired behaviour.
Thanks for your suggestions.
Can we use IP lookup enrichment for blocking some countries/cities? If not, what is the utilization of this enrichment?
The IP lookup enrichment runs after collection so its primary use is to add geographic information to an event for analysis and filtering - rather than blocking.
If you want to stop events before they are collected this depends a bit more on the use case e.g., do you want to stop events because you don’t have consent to collect or you just don’t want to collect for some other reason?
Depending on this you could look at blocking countries at the CDN level (though a warning that this still be an approximation based on the IP address) or alternately you could run some client side code that retrieves the country from an IP address using an API (such as ipify) and then determines whether the tracker should be initialised or not.
Can you please explain , how this API enrichment can be implemented?
I suppose , you referring to this as API enrichment:
Yes - you would send the parts of the event that you want to filter to this API and then you could flag the events appropriately and remove them from your database.