Approaches to access data in S3

Hiya,
We have been running Snowplow for almost a year now in our AWS environment with the full pipeline depositing data into a S3 sink. Right now, only our data scientists are able to access this data. I’d like to make this data available to a wider audience via a BI tool like Quicksight or Metabase.

Here are things we have considered.

  • Run the RDB shredder / loader. It appears this is the canonical approach, but I’d rather not start up and maintain instances to handle something that looks fairly simple and should be able to run serverless.

  • Use Glue to run an ETL process from S3 - > RDMS. This seems straightforward to implement, but I’m currently having problems getting the AWS crawler to reliably build and update the data catalogs for this.

  • If I can get the above the work, connect directly via Athena

  • Use Kinesis firehose to pull the data from the enriched stream.

Any experiences with these approaches (or others). I see most of the discussion on this occurred around 2017 or 2018, so I’m looking for some more recent experiences given the changes in Snowplow and AWS since then.

Thanks for you feedback!

-Tim

Hey @timolin,

If the goal is to make the data accessible in a BI tool, then the most common path is to load to a database first. But it won’t be very accessible even then - the pipeline produces a raw event log, but that data would normally go through a (normally SQL) data modeling process to apply business logic and aggregate to a level where it’s easy and consumable to produce reports in a BI tool. You can read a quick overview of that, and look through the docs for existing standard data models here.

I would recommend setting up the loader (eg for Redshift RDB shredder and loader), setting up a data model, then using BI tools on the derived tables that the model produces.

To go through your thoughts one by one with my own personal take:

Run the RDB shredder / loader. It appears this is the canonical approach, but I’d rather not start up and maintain instances to handle something that looks fairly simple and should be able to run serverless.

I would be careful about assuming that this is simple and should be able to run serverless. Enriched data is a hybrid TSV <> Json format. Shredding/Transforming it into a DB-friendly format is a memory-intensive job, and loading it as-is will likely cause some headaches (a recent example of the kind of thing I mean).

Use Glue to run an ETL process from S3 - > RDMS. This seems straightforward to implement, but I’m currently having problems getting the AWS crawler to reliably build and update the data catalogs for this.

If I can get the above the work, connect directly via Athena

You can actually query the enriched data via Athena directly already - here’s a guide

If it’s your preference, there is a potentially good pattern that I’d love to explore but haven’t had the chance yet. If you manage to create the data modeling process using Athena/Glue to query over the S3 data, and have it output to a derived table in Redshift, then you don’t need to load the atomic data into warehouse.

The tricky parts are that a) I’m not sure how incremental logic works in this pattern, and b) Exploration of the data is definitely more difficult this way (and this is often underestimated - it’s very hard to come up with a good data model without transparent exploration).

A nice way to handle that latter problem would be to load some of the data to warehouse, use that to come up with the SQL model required, then translate that to work with Glue.

If it were to work well this solution may well end up being cheaper and more efficient than loading to warehouse.

Use Kinesis firehose to pull the data from the enriched stream.

Unfortunately I don’t think firehose adds any value here - the S3 loader already sinks to S3, and because of above mentioned need to transform the enriched format to something more database-friendly, I’m skeptical that it can be a good solution.

At least, those are my views based on the typical use cases. Hope it’s helpful!

2 Likes

Hi @Colm -

Thanks for the thoughtful feedback.

I decided to try approach using Glue that goes something like this:

  1. Create the atomic events table in Glue
  2. Create a connection to my Redshift instance
  3. Set up a Glue ETL job to move data from the Glue table to the Redshift instance.
  4. Configure the job with a bookmark that should handle the issue of incremental loading.
  5. Configure the job to trigger off of a cron schedule.

Still in the early stages of this and will report back.