Scala collector robots.txt

wleftwich · January 10, 2021, 8:26pm

We’ve got an SEO consultant telling us that having a lot of bot traffic on our collector endpoint is a Bad Thing. (Googlebot hits it pretty regularly.)

Two questions:

Is it possible to configure the Scala collector to respond to a request for robots.txt?
Does it make any sense to do this?

We have already suggested putting a robots.txt on the server that delivers the tracker library and the one that delivers our javascript src. Apparently that’s not good either, because then Google will ding us for having busted Javascript.

Any advice appreciated.

mike · January 10, 2021, 8:45pm

How regularly out of interest? Getting crawled by Googlebot is pretty standard and if it’s super frequent you can adjust how often this happens in Google Search Console.

No, not at the moment but it wouldn’t be particularly complicated to add in as the collector would just need to serve a static file. You can however set the X-Robots-Tag header on the root path - though you may need to set that for every URL if you don’t want it indexed.

I’m not an SEO expert by any means, and I don’t think it’d hurt to do this but I also don’t think Google would be penalising you in anyway in the same way it doesn’t penalise any other API services for not having a robots.txt. You aren’t really serving any HTML content so I can’t imagine that Google is actually going to be indexing any of this content - though it may crawl it.

wleftwich · January 11, 2021, 11:06am

Thanks Mike. I will look into the X-Robots-Tag header.

I am not convinced that Googlebot hitting our collector is a real problem, but I have to respond to the SEO Guy. You probably know how that goes …

We have a fair number of sites, and the level of bot traffic that runs our tracker and hits our collector has never been considered excessive. (Also, bots tend to cache their collector requests, which makes it pretty easy to find them in the events table.)

What I’m dealing with now is that we are launching a new site that logs 5 to 10 events on each page. SEO Guy is analyzing a dev instance, running some client application that acts like googlebot. He has raised a flag about all the traffic on the collector, and Management Is Concerned.

Thanks again.

mike · January 11, 2021, 9:45pm

I’ve created a Github issue for this because I’ve had more of a think about it and think there is a legitimate case to be made for serving a robots.txt.

Namely

Having robots crawl (Googlebot or otherwise) any of the collector endpoints means that the collector needs to respond and there’s a non-zero cost to doing so
Robots (both good and bad) may inadvertently create bad rows, mostly via sending empty payloads and creating adapter failures which then trickle downstream to any bad rows sinks. This creates additional noise + network transfer + data storage for these events which nobody is really ever going to use. Bad robots are likely to ignore robots.txt, but if it reduces the volume of well-behaved robots I think that is still beneficial overall.

In this case I don’t think the robots.txt solution will fix this problem as it’ll prevent crawling of the collector but not necessarily crawling of the website and execution of the tracker JS which will fire bot events. In those instances I’d highly recommend filtering traffic with the IAB enrichment which will flag bots and either filtering or excluding this traffic. In some instances it is quite useful to retain this data - particularly if you’d like to analyse how and at what frequency bots are crawling the site.

mike · January 20, 2021, 10:18pm

As a heads up @wleftwich I’ve added a pull request to add this functionality into the collector.

wleftwich · January 21, 2021, 10:30am

Thanks for the followup, Mike. I’ll also your implement your suggestion re the IAB enrichment.

Topic		Replies	Views
Resolved - POST request not reaching scala collector Collectors	4	1587	August 10, 2018
Javascript Tracker Request Log In Scala Stream Collector	0	806	September 3, 2020
Bots detection/ bots ip list	1	997	January 4, 2023
Snowplow not marking GoogleBot as bot traffic? For engineers	6	2588	November 29, 2016
ELB_5XX 504 Errors with Scala Collector For engineers	5	2270	February 26, 2020

Scala collector robots.txt

Related Topics