The most common bots (such as Googlebot) self-identify as bots in their user agent strings. Snowplow has 2 configurable enrichments that parse the user agent string. Both can be used to exclude bots form queries in Redshift.
1. Excluding bots using the user agent string
You can filter out bots using the following expression (thanks to Treatwell for sharing this):
WHERE useragent NOT SIMILAR TO '%(bot|crawl|slurp|spider|archiv|spinn|sniff|seo|audit|survey|pingdom|worm|capture|(browser|screen)shots|analyz|index|thumb|check|facebook|YandexBot|Twitterbot|a_archiver|facebookexternalhit|Bingbot|Googlebot|Baiduspider|360(Spider|User-agent))%'
2a. Excluding bots using the user agent utils enrichment
If the user agent utils enrichment is enabled, you can inspect the most common browser families by running:
SELECT br_family, COUNT(*) FROM atomic.events GROUP BY 1 ORDER BY 2 DESC
You can exclude bots from queries by filtering on
br_family != 'Robot/Spider' in SQL.
2b. Excluding bots using the ua parser enrichment
If the ua parser enrichment is enabled, you can use
SELECT useragent_family, COUNT(*) FROM atomic.com_snowplowanalytics_snowplow_ua_parser_context_1 GROUP BY 1 ORDER BY 2 DESC
This will be a more detailed list than the user agent utils enrichment provides. You can exclude bots by filtering out the most common ones. For example:
WHERE useragent_family NOT IN ('Googlebot', 'PingdomBot', 'PhantomJS', 'Slurp', 'BingPreview', 'YandexBot')
I recommend using both enrichments, as neither one is perfect, and they will complement each other.
Events from bots that mask as a real browser are harder to exclude. There exist 3rd party services that help companies identify and block bots. Let us know if you know of other approaches to exclude bot traffic.