Directory structure for enriched data to support AWS Athena

christoph-buente · June 14, 2017, 1:04pm

We are using the stream enricher pipeline. All enriched events are being store in a S3 using the kinesis-s3 sink. The problem is, that all files end up in the same directory. The only way we can tell the dates apart are the “-” separated file name prefix.

This sort of file organization makes it practically impossible to be used with AWS Athena to perform adhoc queries on that data as there is only one partition which might grow really big. In our example it’s close to 40TB.

Proposal

I suggest to write files belonging to a given day into it’s own directory. Instead of writing files like this:

<YYYY>-<MM>-<DD>-<START_POINTER>-<END_POINTER>.gz

for example:
2016-04-27-49561210860930246946106208671673286283317939965855268866-49561210860930246946106208671675704134957169292924157954.gz

The desired structure would be:
/YYYY
=>/MM
=>/DD
=>/START_POINTER-END_POINTER.gz

This would allow AWS Athena, or any software accessing that data, to load only the data from the dates we are interested in by specifying a path.

Implications

As i understand, that batch and real time pipeline are supposed to work in a similar way, it would mean that both parts of the pipeline need to be adjusted to pick up enriched data from the directory.

Question

Would the proposed change be useful for anyone else except us? How much of an effort would that imply including to migrate existing data?

Thoughts?

mike · June 15, 2017, 12:03am

I think long term there’s definitely some advantages to moving to a format that’s easier to filter based on prefix.

A temporary fix might be to write a Lambda that triggers on PUTs to the bucket and copies (or moves) the key name from s3://original-bucket/YYYY-MM-DD-start_pointer-end_pointer.gz to s3://new-bucket/YYYY/MM/DD/start_pointer-end_pointer.gz?

christoph-buente · June 15, 2017, 4:48pm

I certainly know, how to fix it for us. I was actually wondering, if we are the only ones with this problem.

rbolkey · June 20, 2017, 4:20pm

No, you’re not the only ones, this is a pain point we’ve felt internally as well and having to work around.

dashirov · June 20, 2017, 9:43pm

When does AWS plan to release the new data catalog service? Definitely within a 12 month roadmap, but I don’t have the details. Would it then be better to simply add/remove data catalog entries every time s3 object is created or moved or deleted? This way data can be analyzed by redshift spectrum, athena and possibly other users of the catalog. And it would come cheap with a few API calls here and there.

christoph-buente · June 21, 2017, 8:27am

I searched, but didn’t find anything useful. Can you point me to some resource?

dashirov · June 21, 2017, 7:07pm

This came up at our meeting with AWS sales and support. It may be a part of AWS Glue ( https://aws.amazon.com/glue ) or a stand-alone product. What was discussed, was an imminent GA release sometime Q3-Q4 2017, possibly but unlikely later. This is specifically in a context of s3 data lake and accessibility through redshift spectrum, which requires a data catalog. We first thought of adding s3 objects to Athena for exposure to Spectrum, but through the conversation it was suggested to wait until the proper solution hits the market.

christoph-buente · June 26, 2017, 10:58am

Thanks @dashirov for the link. Sounds pretty interesting, let’s see.

But in any way, wouldn’t it be beneficial, to have an option for your config, where to write the files?

Reconsidering AWS lambda: As AWS S3 does not support a “move” command, the lambda function would need to download the content of the file and upload it back to a different location. And in case it was successful, the lambda function would need to delete the old file. That would essentially double the write access to S3. Does not sound ideal to me.

@alex What do you think?

alex · June 26, 2017, 12:09pm

Hi @christoph-buente, @dashirov, @mike - thanks for your thoughts.

We are definitely interested in doing something here; @dilyan currently has a Discourse post on working with Snowplow data from Athena in draft, which we hope to publish soon. (After which we will circle back on this.)

The S3 Data Lake / Data Catalog sounds interesting!

dilyan · June 27, 2017, 7:52am

The post on using Athena to query the enriched events on S3 has now been published:

http://discourse.snowplow.io/t/using-aws-athena-to-query-the-good-bucket-on-s3/1227

dashirov-ga · August 14, 2017, 2:33pm

Aws glue has just been announced by amazon.

atharvai · April 26, 2018, 1:18pm

We’d like this too. even if we get a prefix config where we can use date format tokens that would be good. hopefully the internals don’t have to change too much for just adding prefix config

Topic		Replies	Views
Replay collector data from s3 firehose files to enrich For engineers	4	1541	June 4, 2021
S3 loader kinesis stream configuration optimization AWS real-time pipeline	1	930	June 9, 2021
Replay data from S3 AWS real-time pipeline	3	2485	February 14, 2018
Consuming data in S3 through different services - Is there a way to use Glue Data Catalog? (or other solution) For data modelers & consumers	2	2902	October 2, 2018
Configuring Batch + Real-time Pipelines in Parallel For engineers	6	1906	January 17, 2023

Directory structure for enriched data to support AWS Athena

Proposal

Implications

Question

Related Topics