The Snowplow Javascript tracker doesn’t distinguish between bots and humans. If a bot (spiders, scrapers, …) executes Javascript, and not all do, the tracker will create events and send them to the collector. Bot traffic can skew results, depending on the amount of traffic, if it’s not excluded.
The most common bots (such as Googlebot) self-identify as bots in their user agent strings. Snowplow has 2 configurable enrichments that parse the user agent string. Both can be used to exclude bots form queries in Redshift.
1. Excluding bots using the user agent string
You can filter out bots using the following expression (thanks to Treatwell for sharing this):
WHERE useragent NOT SIMILAR TO '%(bot|crawl|slurp|spider|archiv|spinn|sniff|seo|audit|survey|pingdom|worm|capture|(browser|screen)shots|analyz|index|thumb|check|facebook|YandexBot|Twitterbot|a_archiver|facebookexternalhit|Bingbot|Googlebot|Baiduspider|360(Spider|User-agent))%'
2a. Excluding bots using the user agent utils enrichment
If the user agent utils enrichment is enabled, you can inspect the most common browser families by running:
SELECT
br_family,
COUNT(*)
FROM atomic.events
GROUP BY 1
ORDER BY 2 DESC
You can exclude bots from queries by filtering on br_family != 'Robot/Spider'
in SQL.
2b. Excluding bots using the ua parser enrichment
If the ua parser enrichment is enabled, you can use useragent_family
in atomic.com_snowplowanalytics_snowplow_ua_parser_context_1
instead:
SELECT
useragent_family,
COUNT(*)
FROM atomic.com_snowplowanalytics_snowplow_ua_parser_context_1
GROUP BY 1
ORDER BY 2 DESC
This will be a more detailed list than the user agent utils enrichment provides. You can exclude bots by filtering out the most common ones. For example:
WHERE useragent_family NOT IN ('Googlebot', 'PingdomBot', 'PhantomJS', 'Slurp', 'BingPreview', 'YandexBot')
I recommend using both enrichments, as neither one is perfect, and they will complement each other.
Events from bots that mask as a real browser are harder to exclude. There exist 3rd party services that help companies identify and block bots. Let us know if you know of other approaches to exclude bot traffic.