Directory structure for enriched data to support AWS Athena


#1

We are using the stream enricher pipeline. All enriched events are being store in a S3 using the kinesis-s3 sink. The problem is, that all files end up in the same directory. The only way we can tell the dates apart are the “-” separated file name prefix.

This sort of file organization makes it practically impossible to be used with AWS Athena to perform adhoc queries on that data as there is only one partition which might grow really big. In our example it’s close to 40TB.

Proposal

I suggest to write files belonging to a given day into it’s own directory. Instead of writing files like this:

<YYYY>-<MM>-<DD>-<START_POINTER>-<END_POINTER>.gz

for example:
2016-04-27-49561210860930246946106208671673286283317939965855268866-49561210860930246946106208671675704134957169292924157954.gz

The desired structure would be:
/YYYY
=>/MM
  =>/DD
    =>/START_POINTER-END_POINTER.gz

This would allow AWS Athena, or any software accessing that data, to load only the data from the dates we are interested in by specifying a path.

Implications

As i understand, that batch and real time pipeline are supposed to work in a similar way, it would mean that both parts of the pipeline need to be adjusted to pick up enriched data from the directory.

Question

Would the proposed change be useful for anyone else except us? How much of an effort would that imply including to migrate existing data?

Thoughts?


Using AWS Athena to query the shredded events
Can we use spectrum to query shredded data instead of enriched?
#4

I think long term there’s definitely some advantages to moving to a format that’s easier to filter based on prefix.

A temporary fix might be to write a Lambda that triggers on PUTs to the bucket and copies (or moves) the key name from s3://original-bucket/YYYY-MM-DD-start_pointer-end_pointer.gz to s3://new-bucket/YYYY/MM/DD/start_pointer-end_pointer.gz?


#5

I certainly know, how to fix it for us. I was actually wondering, if we are the only ones with this problem.


#6

No, you’re not the only ones, this is a pain point we’ve felt internally as well and having to work around.


#7

When does AWS plan to release the new data catalog service? Definitely within a 12 month roadmap, but I don’t have the details. Would it then be better to simply add/remove data catalog entries every time s3 object is created or moved or deleted? This way data can be analyzed by redshift spectrum, athena and possibly other users of the catalog. And it would come cheap with a few API calls here and there.


#8

I searched, but didn’t find anything useful. Can you point me to some resource?


#9

This came up at our meeting with AWS sales and support. It may be a part of AWS Glue ( https://aws.amazon.com/glue ) or a stand-alone product. What was discussed, was an imminent GA release sometime Q3-Q4 2017, possibly but unlikely later. This is specifically in a context of s3 data lake and accessibility through redshift spectrum, which requires a data catalog. We first thought of adding s3 objects to Athena for exposure to Spectrum, but through the conversation it was suggested to wait until the proper solution hits the market.


#10

Thanks @dashirov for the link. Sounds pretty interesting, let’s see.

But in any way, wouldn’t it be beneficial, to have an option for your config, where to write the files?

Reconsidering AWS lambda: As AWS S3 does not support a “move” command, the lambda function would need to download the content of the file and upload it back to a different location. And in case it was successful, the lambda function would need to delete the old file. That would essentially double the write access to S3. Does not sound ideal to me.

@alex What do you think?


#11

Hi @christoph-buente, @dashirov, @mike - thanks for your thoughts.

We are definitely interested in doing something here; @dilyan currently has a Discourse post on working with Snowplow data from Athena in draft, which we hope to publish soon. (After which we will circle back on this.)

The S3 Data Lake / Data Catalog sounds interesting!


#12

The post on using Athena to query the enriched events on S3 has now been published:

http://discourse.snowplowanalytics.com/t/using-aws-athena-to-query-the-good-bucket-on-s3/1227


#13

Aws glue has just been announced by amazon.


#14

We’d like this too. even if we get a prefix config where we can use date format tokens that would be good. hopefully the internals don’t have to change too much for just adding prefix config