AWS Athena as an alternative data store


My questions are:

(a) how, and at what point in the pipeline, should I convert enriched Snowplow logs to ORC?

and (b) would anyone recommend special processing steps for depositing enriched event data into S3 for consumption by AWS Athena? Similar to Redshift/Postgres storage, I assume I’ll have to setup tables, but beyond that I’m unclear.

I’d like to setup Snowplow (read: I’m brand new) such that the event data land in an s3 bucket (ideally in ORC file format, as that seems to be the most performant), for consumption via AWS Athena (instead of Postgres, Redshift, et al.).

I’m working through the Snowplow docs, but haven’t been able to determine whether or not this is a reasonably simple departure from normal operations. Right now I have the Cloudfront collector set-up, but no ETL.

There’s some discussion suggesting that Athena ought to work with Snowplow, but I’ve found nothing about using it as a data store & it’s not clear to me how the ETL process has to differ (if at all) from standard enrichment.