Excluding bots from queries in Redshift [tutorial]


#1

The Snowplow Javascript tracker doesn’t distinguish between bots and humans. If a bot (spiders, scrapers, …) executes Javascript, and not all do, the tracker will create events and send them to the collector. Bot traffic can skew results, depending on the amount of traffic, if it’s not excluded.

The most common bots (such as Googlebot) self-identify as bots in their user agent strings. Snowplow has 2 configurable enrichments that parse the user agent string. Both can be used to exclude bots form queries in Redshift.

1. Excluding bots using the user agent string

You can filter out bots using the following expression (thanks to Treatwell for sharing this):

WHERE useragent NOT SIMILAR TO '%(bot|crawl|slurp|spider|archiv|spinn|sniff|seo|audit|survey|pingdom|worm|capture|(browser|screen)shots|analyz|index|thumb|check|facebook|YandexBot|Twitterbot|a_archiver|facebookexternalhit|Bingbot|Googlebot|Baiduspider|360(Spider|User-agent))%'

2a. Excluding bots using the user agent utils enrichment

If the user agent utils enrichment is enabled, you can inspect the most common browser families by running:

SELECT
  br_family,
  COUNT(*)
FROM atomic.events
GROUP BY 1
ORDER BY 2 DESC

You can exclude bots from queries by filtering on br_family != 'Robot/Spider' in SQL.

2b. Excluding bots using the ua parser enrichment

If the ua parser enrichment is enabled, you can use useragent_family in atomic.com_snowplowanalytics_snowplow_ua_parser_context_1 instead:

SELECT
  useragent_family,
  COUNT(*)
FROM atomic.com_snowplowanalytics_snowplow_ua_parser_context_1
GROUP BY 1
ORDER BY 2 DESC

This will be a more detailed list than the user agent utils enrichment provides. You can exclude bots by filtering out the most common ones. For example:

WHERE useragent_family NOT IN ('Googlebot', 'PingdomBot', 'PhantomJS', 'Slurp', 'BingPreview', 'YandexBot')

I recommend using both enrichments, as neither one is perfect, and they will complement each other.

Events from bots that mask as a real browser are harder to exclude. There exist 3rd party services that help companies identify and block bots. Let us know if you know of other approaches to exclude bot traffic.


Snowplow is reporting 15% more page view numbers than GA
Snowplow and GA are reporting different visitor numbers
Snowplow not marking GoogleBot as bot traffic?
Session ID not resetting
#2

It would be useful to collate sources of information (including services) for identifying if traffic is from a bot or not. The major one that we’re aware of is the IAB/ABC International Bots and Spiders List.

I believe there are also third party services that specialize in tracking (particularly malicious / ad fraud) bots.


#3

Some additional information for Snowplow users who are looking at more sophisticated identification and filtering of bot traffic:

1. We are planning to release an enrichment powered by the IAB bots and spiders list

The relevant issue is here. This will make it possible for any Snowplow user who has purchased the list to automatically use that data to enrich their Snowplow event data. We should be able to offer this enrichment to our Managed Service users with our own copy of the list.

2. More sophisticated techniques for identifying bots

We’ve had some interesting conversations with a number of users about the techniques they’ve employed. I’ll share them below:

2.1 Identify bots based on IP address ranges

E.g. AWS provides a list of IP ranges here:

https://ip-ranges.amazonaws.com/ip-ranges.json

It is highly likely that any traffic from any of those IPs is a bot rather than a human.

2.2 Identify bots based on event velocity

This is not always straigthforward because individual bots will often take measures to appear to be multiple users (e.g. clearing cookies, switching IP addresses). However, if you do see e.g. thousands of events per second for an individual cookie ID or IP address / browser fingerprint combination, that is very likely to be a bot.

Consistency of event volume is another indication that a user is a bot rather than a human. If a user is performing an action at regular time intervals (e.g. once a second) they’re more likely to be a machine. Humans traffic tends to have periods low and high event volumes as users move from scanning a website to browsing a particular article in detail, for example.

2.3 Identify bots based on the ratio of event_fingerprint to event_id

Events captured for the JS tracker have an event_fingerprint, which is a hash of all the client side set data points, and an event_id, which is a UUID generated in the JS.

Often bots wont support the JS libraries necessary to generate proper UUIDs, so multiple different events (as identifiable based on the different event_fingerprint values) will be recorded against the same event_id. In this case, the user is almost certainly a bot, or a user on a very very old fashioned browser.

2.4 Suspicious metrics e.g. session time / page views

If a user has spent 8 days constantly on your website without a break, or has viewed 20k web pages, chances are they’re not a human.

2.5 Honeypot events

Some of our users have created events that are impossible for a human genuinely interacting with the website UI to trigger but might be triggered by a bot. Hidden values on forms is a good example.

Once a bot has triggered the honeypot event that cookie ID and IP address can be readily identified and filtered out.

2.6 Building out bot detection using Machine Learning

The above techniques can be used to deterministically identify a set of bots.

That data can then be used as a training set for a supervised ML algorithm. This is particularly good for better classification based on features like event velocity and session time.

Other approaches?

If you’ve developed other approaches or successfully worked with third parties to better identify bot traffic, let us know on the thread!


#4

Thanks Yali, for bots and generally bad traffic signals, I have worked with the Augur.io API’s which are a handy way of using their data combined with Snowplow data to really get granular with identifying bad traffic. Here is some of the fields you can use at the event level, setting a custom context to store the Augur data.

I found it is cheaper to just pay for the API, store extra data in an Augur specific table in Redshift and just simply exclude domain_userid from SQL using a basic JOIN.

UID: "xxxxxxxxxxxxxxxxxxx"
ID:"xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
fingerprint: Object
hasTimezoneMismatch: false
isBot: false
isIncognito: false
isProxied: false
isTor: false
isUsingTorExitNode: false
manufacturer: "Generic"
name: "OS X PC"
type: "Desktop"

#5

There is also a service called udger. They have an extensive list of web crawlers, which is kept up to date. They offer a subscription model which allows a static download of those user agents. And they offer an API that can be queried for those user agents. If there was an enrichment, that would either make use of a local file (which needs to be updated regularly) or the API (and cache the results), a standard subscription for 10€ a month would be sufficient i guess. Have a look at https://udger.com/resources/ua-list/crawlers

Christoph


#6

Nice! Thanks @digdeep and @christoph-buente for sharing!