Snowplow not marking GoogleBot as bot traffic?


#1

Hi guys,

We are running into some weird GoogleBot behaviour. Normally this GoogleBots are identified by Snowplow by for some reason this bot isn’t. We can (of course) run our own User Agent checks and such but was hoping this is something that was already done inside of Snowplow… like it currently is done with the “br_type” or “br_family”.

As you can see in the screenshot, it works sometimes but not always… What is the current logic inside of Snowplow?

See screenshot below:


#2

Hi @Koen87,

Thanks for flagging this. We use a third-party library to parse the useragent string, so I’m afraid it’s not within our direct control.

I recommend having a look at this thread too: Excluding bots from queries in Redshift [tutorial]

The regex on the useragent string does catch those exceptions.

Hope this helps,

Christophe


#3

It might be that we need to upgrade the version of the useragent parsing libraries we’re using? @alex what’s the easiest way to check this?


#4

Yes, it’s actually vital to keep that library up2date.


#5

That would be great Yali. Let me know how you go.


#6

Hi guys,

Okay so we have created tickets for refreshing both of our current useragent enrichments:

  1. Scala Common Enrich: bump user-agent-utils to 1.20 #2930
  2. Scala Common Enrich: bump ua-parser to latest version #2931

Unfortunately both libraries are problematic:

  • user-agent-utils was EOLed 13 days ago (though we still have one upgrade we can do). I have reached out to the author to find out more
  • uap-java is not available on Maven Central and is not up-to-date with the latest uap project regexps

So a fair bit of work on our side to get these libraries back on track, but it’s something we will take seriously.


#7

A couple of users have recommended we look at WURFL as an alternative (paid for) library for user agent parsing. I’ve created a ticket here:

https://github.com/snowplow/snowplow/issues/2966