We recently had an opportunity to use AWS Athena to query the raw logs from the Clojure collector. We’d like to share what we have learnt. This could be useful, eg in investigating whether the number of events hitting the collector matches the number that makes it into Redshift (allowing for events that fail validation and end up in the bad rows).
To start off with, a reminder of how Athena works. All tables in Athena are external tables. In effect, when creating a table in Athena, you are simply creating a schema for that table. That schema then gets applied to the underlying data in S3.
The first step then is to figure out what our table definition should be.
This is what an example log file record looks like:
2016-11-27 07:16:07 - 43 185.124.153.90 GET 185.124.153.90 /i 200 http://chuwy.me/scala-blocks.html Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_11_6%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F54.0.2840.98+Safari%2F537.36 stm=1480230967340&e=pv&url=http%3A%2F%2Fchuwy.me%2Fscala-blocks.html&page=Scala%20Code%20Blocks&refr=http%3A%2F%2Fchuwy.me%2F&tv=js-2.7.0-rc2&tna=blogTracker&aid=blog&p=web&tz=Asia%2FOmsk&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=1&f_java=0&f_gears=0&f_ag=0&res=1280x800&cd=24&cookie=1&eid=1799a90f-f570-4414-b91a-b0db8f39cc2e&dtm=1480230967333&vp=1280x726&ds=1280x4315&vid=18&sid=395e4506-37a3-4074-8de2-d8c75fb17d4a&duid=1f9b3980-6619-4d75-a6c9-8253c76c3bfb&fp=531497290&cv=clj-1.1.0-tom-0.2.0&nuid=5beb1f92-d4fb-4020-905c-f659929c8ab5 - - - - -
This is a tab-delimited file, meaning blank space signals the end of a column and the start of a new one. When we first developed the Clojure collector, we deliberately designed the log file format to mimic the Cloudfront log file format. The two have diverged since but they are still close enough so that the table definition used here is a useful starting point.
We also have a description of each field in the log file in our Scala Common Enrich library.
This is the table definition we ended up with:
CREATE EXTERNAL TABLE IF NOT EXISTS raw_lines (
date DATE,
time STRING,
x_edge_location STRING, -- added for consistency with CloudFront
bytes BIGINT,
ip_address STRING,
request_type STRING,
xff_http_header STRING,
object_uri STRING,
request_status BIGINT,
referer STRING,
user_agent STRING,
query_string STRING,
cookie STRING, -- added for consistency with CloudFront
x_edge_result_type STRING, -- added for consistency with CloudFront
x_edge_request_id STRING, -- added for consistency with CloudFront
content_type STRING, -- POST support
post_body STRING -- POST support
)
PARTITIONED BY(run STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://path/to/raw/archive';
Once the table is created, we need to load the partitions (the data is partitioned by run
, ie etl_tstamp
):
MSCK REPAIR TABLE raw_lines;
Now that we have a table in Athena and the data has been loaded, we can run some queries. For example, we may want to know the number of events hitting the collector per day:
SELECT
date,
count(*)
FROM raw_lines
WHERE run > '2017-07-01' -- optional filter to get faster results
AND referer LIKE 'http://chuwy.me/scala-blocks.html' -- optional filter to only count events from a specific referer
GROUP BY 1
ORDER BY 1;
Or, we may want to look for a specific event that we’ve sent in a GET
request via the Iglu webhook. Provided we know the schema, something like this should do the trick:
SELECT
date,
count(*)
FROM raw_lines
WHERE run > '2017-07-01'
AND position('schema=iglu%3Acom.acme%2Fcampaign%2Fjsonschema%2F1-0-0' in query_string) > 0 -- decodes to 'schema=iglu:com.acme/campaign/jsonschema/1-0-0'
GROUP BY 1
ORDER BY 1;
We can then compare the numbers to what we’re seeing in Redshift to see if they match. (And we should not forget to also take into account the events that would have failed validation.)