We are using the stream enricher pipeline. All enriched events are being store in a S3 using the kinesis-s3 sink. The problem is, that all files end up in the same directory. The only way we can tell the dates apart are the “-” separated file name prefix.
This sort of file organization makes it practically impossible to be used with AWS Athena to perform adhoc queries on that data as there is only one partition which might grow really big. In our example it’s close to 40TB.
I suggest to write files belonging to a given day into it’s own directory. Instead of writing files like this:
The desired structure would be:
This would allow AWS Athena, or any software accessing that data, to load only the data from the dates we are interested in by specifying a path.
As i understand, that batch and real time pipeline are supposed to work in a similar way, it would mean that both parts of the pipeline need to be adjusted to pick up enriched data from the directory.
Would the proposed change be useful for anyone else except us? How much of an effort would that imply including to migrate existing data?