Can we use spectrum to query shredded data instead of enriched?


#1

Hi,

We are tracking a variety of events (standard page_view, screen_view, clicks, and our own unstruct events in custom tables) using snowplow and storing all these events in Redshift.
But right now we are tracking only ~10% of overall events due to cost of redshift.

As we all know that redshift spectrum is now available we are thinking of tracking 100% data and use spectrum to query that data directly on S3.

While searching I encountered this link using aws athena to query the good bucket on s3.

Above mentioned link talks about accessing data in enriched/good from S3. But unstruct data will be available in JSON format and will need to extract while querying.

We are thinking to query shredded/good instead of enriched/good. The advantage will be that we will have all the custom/unstruct events in separate folders and tsv file so it will be simpler to use (we have queries that generate aggregates by joining many of these tables).

But issue with this approach is we can not directly access the shredded data due to current directory structure :

e.g. :
Below is the path of shredded data of custom event ‘xyz’

/shredded/archive/run=2017-09-17-19-33-48/abc/xyz/jsonschema/1-0-0

With this structure we can not create external table as S3 path is not fixed due to ‘run=’.

So we thought of creating a utility which will copy, everyday, the shredded data from above mentioned path to a different bucket with path something like given below :

shredded/archive/abc/xyz/run=2017-09-17-19-33-48/

We will use spectrum to query data from the above S3 directory.

This is just like how data resides in enriched/good or enriched/archive.

What do you guys think of this approach? Also, if snowplow support any other way to query shredded data using spectrum then we would like to try that out.

Please share your views on this.

Thanks


#3

Hi @rahul. You may also want to check out our follow up post on using Athena to query the shredded tables:

Basically. you’ve identified the same key limitation as we did: the structure we are using to store the shredded data on S3 is not very friendly to the way Athena (and Redshift Spectrum) works.

I think your suggested approach is a sensible way of solving the issue. You may also find this post on the same topic interesting.

Let us know how you get on. We’d be keen to hear of your experiences with using Spectrum on shredded Snowplow data in S3.