Customizing our Snowplow event representation in Redshift


#1

As we are moving through the base Snowplow configuration we had some requirement from our BI team.

We basically want to load the entire Snowplow atomic.events to our own table or alternatively add some fields to the current atomic.events table.

I know I can load my own data to other tables and was successful in doing so.

However, our BI team does not want to constantly join with atomic.events and want to partition the data daily. We are basically very keen on using the enrichment features of snow plow and the entire pipe line but want to control the loading process and simplify it for our BI process.

Any thoughts on how can I proceed with this?

Thanks


#2

You should be able to use sql runner for doing something like that.

I do not know the specific requirements for what your BI team is asking, but in theory you should be able to write a playbook to run an ETL process on the data in atomic and then send the output to another table, or even another schema, partitioned by day.

This way, you can keep atomic.events matching snowplow releases, and have another set of data that the BI team can use.


#3

@13scoobie, I’m working in the BI department at Nir’s company.
Doing it through an SQL ETL process is very inefficient.
The correct way is to copy it directly to the correct table with the correct fields. It saves a lot of time.


#4

It’s hard to know without knowing more about the use case at the moment:

  • How many rows per day are you loading in atomic.events?
  • Are you using custom contexts/shredding?
  • What’s the approximate data size per day?
  • What are the BI team joining on in atomic.events?

#5

We have very similar use case and sql works extremely well for us. We load nearly 100mm events every night. Immediately after populating atomic.events we run a set of jobs that create smaller tables with the data we care about and then archive the atomic data older than a few days.

We use Airflow to do all of that but should work with Sqlrunner just the same.

If that’s not good enough you could look into processing the enriched events over map reduce or spark or even plug in to Kinesis to digest events in real time.


#6

Yes - if you need to process the enriched events in say Scalding or Spark, you can use the Snowplow Scala Analytics SDK to work with them.


#7

Hi @alex,

Things like link_click table I totally understand. But, client_session table I can’t really understand.
This data is needed for every event, and it hurts the performance when joining every time.
The best thing would be to perform the enrichment and return the original data + results via json. When it’s through a json file, you can use redshift copy via json, and then it copies only the parameters which are in the table. And the order doesn’t matter then. Also, it would allow to add any free fields as we’d like.

Share your thoughts.


#8

Not everybody uses the client_session context:

  1. Some users prefer to implement their own sessionization in their event data modeling phase
  2. Some users are emitting events from server-side and other environments where the concept of a session doesn’t apply

Our long-term roadmap is to factor everything out of atomic.events, so it becomes eventually just a reference map to all the snapshotted entities for the given event. Long-term we can’t assume what entities will be “normal” for an event to contain.

If you know with approximate certainty what entities your events will always contain, it’s totally valid to rebuild your own “fat table” in Redshift prior to analysis - it will be more performant, at the cost of some flexibility.


#9

Hi @alex,

If you know with approximate certainty what entities your events will always contain, it’s totally valid to rebuild your own “fat table” in Redshift prior to analysis - it will be more performant, at the cost of some flexibility.

We know the structure of the table we’d like to have, but we’d also like your great enrichment fields to go there. If it is possible, that would solve everything.


#10

Sure thing - you can add whatever enrichments, self-describing events etc you want into your own fat table.