We recently had an opportunity to use AWS Athena to query the raw logs from the Clojure collector. We’d like to share what we have learnt. This could be useful, eg in investigating whether the number of events hitting the collector matches the number that makes it into Redshift (allowing for events that fail validation and end up in the bad rows).
To start off with, a reminder of how Athena works. All tables in Athena are external tables. In effect, when creating a table in Athena, you are simply creating a schema for that table. That schema then gets applied to the underlying data in S3.
The first step then is to figure out what our table definition should be.
This is what an example log file record looks like:
2016-11-27 07:16:07 - 43 220.127.116.11 GET 18.104.22.168 /i 200 http://chuwy.me/scala-blocks.html Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_11_6%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F54.0.2840.98+Safari%2F537.36 stm=1480230967340&e=pv&url=http%3A%2F%2Fchuwy.me%2Fscala-blocks.html&page=Scala%20Code%20Blocks&refr=http%3A%2F%2Fchuwy.me%2F&tv=js-2.7.0-rc2&tna=blogTracker&aid=blog&p=web&tz=Asia%2FOmsk&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=1&f_java=0&f_gears=0&f_ag=0&res=1280x800&cd=24&cookie=1&eid=1799a90f-f570-4414-b91a-b0db8f39cc2e&dtm=1480230967333&vp=1280x726&ds=1280x4315&vid=18&sid=395e4506-37a3-4074-8de2-d8c75fb17d4a&duid=1f9b3980-6619-4d75-a6c9-8253c76c3bfb&fp=531497290&cv=clj-1.1.0-tom-0.2.0&nuid=5beb1f92-d4fb-4020-905c-f659929c8ab5 - - - - -
This is a tab-delimited file, meaning blank space signals the end of a column and the start of a new one. When we first developed the Clojure collector, we deliberately designed the log file format to mimic the Cloudfront log file format. The two have diverged since but they are still close enough so that the table definition used here is a useful starting point.
We also have a description of each field in the log file in our Scala Common Enrich library.
This is the table definition we ended up with:
CREATE EXTERNAL TABLE IF NOT EXISTS raw_lines ( date DATE, time STRING, x_edge_location STRING, -- added for consistency with CloudFront bytes BIGINT, ip_address STRING, request_type STRING, xff_http_header STRING, object_uri STRING, request_status BIGINT, referer STRING, user_agent STRING, query_string STRING, cookie STRING, -- added for consistency with CloudFront x_edge_result_type STRING, -- added for consistency with CloudFront x_edge_request_id STRING, -- added for consistency with CloudFront content_type STRING, -- POST support post_body STRING -- POST support ) PARTITIONED BY(run STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 's3://path/to/raw/archive';
Once the table is created, we need to load the partitions (the data is partitioned by
MSCK REPAIR TABLE raw_lines;
Now that we have a table in Athena and the data has been loaded, we can run some queries. For example, we may want to know the number of events hitting the collector per day:
SELECT date, count(*) FROM raw_lines WHERE run > '2017-07-01' -- optional filter to get faster results AND referer LIKE 'http://chuwy.me/scala-blocks.html' -- optional filter to only count events from a specific referer GROUP BY 1 ORDER BY 1;
Or, we may want to look for a specific event that we’ve sent in a
GET request via the Iglu webhook. Provided we know the schema, something like this should do the trick:
SELECT date, count(*) FROM raw_lines WHERE run > '2017-07-01' AND position('schema=iglu%3Acom.acme%2Fcampaign%2Fjsonschema%2F1-0-0' in query_string) > 0 -- decodes to 'schema=iglu:com.acme/campaign/jsonschema/1-0-0' GROUP BY 1 ORDER BY 1;
We can then compare the numbers to what we’re seeing in Redshift to see if they match. (And we should not forget to also take into account the events that would have failed validation.)