Since AWS launched Athena late last year, we've been using it to query the 'bad' data (i.e. data that fails validation) directly on S3. (Check out our tutorial on how to do that: Debugging bad rows in Athena.)
Athena also holds great promise for querying "good" Snowplow data in S3, potentially as an alternative to, or alongside, querying it in Redshift. Athena is a particularly attractive querying tool for users who have very large volumes of Snowplow data, but have very variable query workloads. (So paying for a permanent Redshift cluster is cost-prohibitive.)
In this post, we explore how to query your Snowplow data in S3 using Athena: specifically how to query the 'enriched' rather than the 'shredded' data. (The difference will be described below.) Do note that although this is possible, there are some limitations to doing so related to the way that Snowplow data is currently serialized in S3. We are looking to improve this so that working with Snowplow data in Athena becomes more efficient going forwards: this is explored towards the end of this post.
A note on how Athena works
All tables created in Athena are external tables. In effect, whenever you create a table in Athena, you are actually simply creating the schema for that table. When you then run queries against it, that schema gets applied to the underlying data on S3 and the query is run against the resulting table.
This means that when you drop a table in Athena, you're only dropping that particular schema. The underlying data is still safely sitting in your S3 instance. So you can safely experiment with different table definitions and if they don't work out, you can just drop them with no underlying data loss.
Understanding the way Snowplow data is currently stored in S3: enriched and shredded data sets
As standard, Snowplow outputs two copies of your processed data to S3:
- An 'enriched' data set. The structure of the 'enriched data' maps to the
atomic.events table in Redshift prior to our rolling out support for self-describing events and custom contexts in Snowplow. (At which point we moved to storing the data in Redshift in a multi-table model, one table for each self-describing event and custom context.) The format of this data is a tab-delimited text file. Self-describing events and custom contexts are stored in dedicated columns in the tsv as self-describing JSONs.
- A 'shredded' data set optimised for loading Redshift. This data is in a new-line delimited JSON format. We call this the 'shredded' data because the different self-describing event and custom context JSONs present in the enriched data have been 'shredded' out into individual tables for efficient loading and querying in Redshift.
In this post we explore querying the 'enriched' data with Athena. In a follow-up post we'll explain how to query the shredded the data.
atomic.events table in Athena
Create the following table in Athena:
CREATE EXTERNAL TABLE atomic_events (
PARTITIONED BY(run STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
This schema follows the format of the unshredded event in S3. As you can see, this is pretty much the
atomic.events table that you have in Redshift today, with three notable additions:
unstruct_event is a self-describing JSON with the data that is specific to the event, for example a
contexts is an array of contexts sent with the event;
derived_contexts is an array of derived contexts sent with the event.
The actual format of the values of these fields is
JSON but here we're loading them as
Next, load the partitions. (The data is partitioned by
MSCK REPAIR TABLE atomic_events;
Running a few simple queries
Now that we have our
atomic_events table in Athena, we can try running a few simple queries to confirm it's working as expected. Try:
SELECT count(DISTINCT page_urlhost)
WHERE run > '2017-06-14';
SELECT page_urlpath, count(*)
WHERE run > '2017-06-14'
GROUP BY 1
ORDER BY 2 DESC
It's a good idea to limit your queries by run:
- This reduces the number of rows scanned in S3, which
- decreases query time and
- decreases the cost of each query.
Querying data within the self-describing JSON fields
Much of the data that we want to query is likely to be in the
unstruct_event field (for any self-describing event) and
contexts field (for any custom context that we send to Snowplow).
Querying other fields in Athena is straightforward. Querying these fields is not, because the data stored in them is self-describing JSON format. However, it is possible.
Let's take a particular use case: we have a particular set of structured events that we capture, recognizable because
se_category is set to
advertisement. We know that for each of these events, a custom context is sent that contains the name of the campaign to which the ad belongs.
The JSON in the
contexts field will look something like:
"campaignName": "Example Campaign",
"adName": "Example Ad",
"adFormat": "Standard Banner",
Now we can use the
json_extract_scalar() function in Athena to extract the value of a specific element:
SELECT json_extract_scalar(contexts, '$.data.data.campaignName') AS campaign_name
WHERE run > '2017-06-14'
AND se_category = 'advertisement';
That should return a result like:
1 Example Campaign
We can extract the other properties of the context (
adFormat, etc) in the same way.
One thing to note is that the outer
data element of the
contexts JSON is an array (because potentially you are sending multiple contexts). So you have to specify which element of the array you want to extract:
'$.data.data.campaignName'. In this case we only have one context, so we want the 0th element. If you don't call that element by index, you will get an empty string in return.
We can use this approach to query the
unstruct_event field as well. Here is a
We can surface the value of the
SELECT json_extract_scalar(unstruct_event, '$.data.data.targetUrl') AS target_URL
WHERE run > '2017-06-14';
In this case the value of the outer
data element is a single JSON object rather than an array, so we don't have to use an index.
A better way to query enriched Snowplow data in Athena
We plan to develop Snowplow to make querying data stored in S3 easier with Athena. In particular, we want to:
- Update the way that data is serialized so that it is easy to directly query all the fields in the event, including any fields that belong to a self-describing event or custom context.
- Make querying the data faster and more efficient by serializing it in a more compact format e.g. Avro with Parquet
- Update our Iglu schema tech to elegantly handle updates to Athena table definitions that need to be made as new schema versions are released.
More on the above shortly!