At its recent re:Invent Amazon announced availability of Athena which let’s you query data in S3 buckets using standard SQL. I decided to give it a try to analyze bad row data in our S3 buckets.
Using Athena is very easy - you just create a new database and point the data source for that data to your S3 bucket. You choose the data format (I used JSON) and define your columns. I set up 3 columns following the format of the bad rows files:
Once you have this setup you can begin to run your queries. Following the guidelines in the Snowplow documentation for filtering out “expected” bad rows I used the following query which returned a list of errors I need to worry about and the count of each one:
SELECT error, count(*) FROM badrows where line not like '%/crossdomain.xml%' and line not like '%OPTIONS%' group by error
Once I had this info I just removed the group by and did a simple select for the lines giving me large number of errors.
The results from these queries came back very fast and you only get charged by Amazon when you run a query. Unlike other methods for searching through bad row data there is no setup of any other AWS resources, and no need to keep a resource online (and incur ongoing costs).
When using Athena you are billed by the amount of data scanned in the query. To reduce costs, I copied a few days of data from my Snowplow bad rows bucket into a new bucket and set that up as my datasource - this way I didn’t need to scan my entire bad rows history each time (and this also speeds up the query).
Tutorials on using Athena can be found here: https://aws.amazon.com/documentation/athena/