Approaches to access data in S3

Colm · May 17, 2021, 4:46pm

If the goal is to make the data accessible in a BI tool, then the most common path is to load to a database first. But it won’t be very accessible even then - the pipeline produces a raw event log, but that data would normally go through a (normally SQL) data modeling process to apply business logic and aggregate to a level where it’s easy and consumable to produce reports in a BI tool. You can read a quick overview of that, and look through the docs for existing standard data models here.

I would recommend setting up the loader (eg for Redshift RDB shredder and loader), setting up a data model, then using BI tools on the derived tables that the model produces.

To go through your thoughts one by one with my own personal take:

Run the RDB shredder / loader. It appears this is the canonical approach, but I’d rather not start up and maintain instances to handle something that looks fairly simple and should be able to run serverless.

I would be careful about assuming that this is simple and should be able to run serverless. Enriched data is a hybrid TSV <> Json format. Shredding/Transforming it into a DB-friendly format is a memory-intensive job, and loading it as-is will likely cause some headaches (a recent example of the kind of thing I mean).

Use Glue to run an ETL process from S3 - > RDMS. This seems straightforward to implement, but I’m currently having problems getting the AWS crawler to reliably build and update the data catalogs for this.

If I can get the above the work, connect directly via Athena

You can actually query the enriched data via Athena directly already - here’s a guide…

If it’s your preference, there is a potentially good pattern that I’d love to explore but haven’t had the chance yet. If you manage to create the data modeling process using Athena/Glue to query over the S3 data, and have it output to a derived table in Redshift, then you don’t need to load the atomic data into warehouse.

The tricky parts are that a) I’m not sure how incremental logic works in this pattern, and b) Exploration of the data is definitely more difficult this way (and this is often underestimated - it’s very hard to come up with a good data model without transparent exploration).

A nice way to handle that latter problem would be to load some of the data to warehouse, use that to come up with the SQL model required, then translate that to work with Glue.

If it were to work well this solution may well end up being cheaper and more efficient than loading to warehouse.

Use Kinesis firehose to pull the data from the enriched stream.

Unfortunately I don’t think firehose adds any value here - the S3 loader already sinks to S3, and because of above mentioned need to transform the enriched format to something more database-friendly, I’m skeptical that it can be a good solution.

At least, those are my views based on the typical use cases. Hope it’s helpful!

Topic		Replies	Views
Aws quickstart optimized snowplow infra For engineers	3	623	January 30, 2023
Snowplow-Mini SaaS Snowplow Mini	2	1159	December 3, 2018
AWS Athena as an alternative data store For engineers	0	1525	January 11, 2017
Deploying snowplow in combination of AWS and Azure For engineers	1	643	January 9, 2023
Is it possible to have snowplow events datamart in S3 datalake instead of Redshift/Snowflake/Databriks Enrichment	0	234	January 4, 2024

Approaches to access data in S3

Related Topics