Customizing our Snowplow event representation in Redshift

NirSivan · September 22, 2016, 2:06pm

As we are moving through the base Snowplow configuration we had some requirement from our BI team.

We basically want to load the entire Snowplow atomic.events to our own table or alternatively add some fields to the current atomic.events table.

I know I can load my own data to other tables and was successful in doing so.

However, our BI team does not want to constantly join with atomic.events and want to partition the data daily. We are basically very keen on using the enrichment features of snow plow and the entire pipe line but want to control the loading process and simplify it for our BI process.

Any thoughts on how can I proceed with this?

Thanks

13scoobie · September 22, 2016, 2:23pm

You should be able to use sql runner for doing something like that.

I do not know the specific requirements for what your BI team is asking, but in theory you should be able to write a playbook to run an ETL process on the data in atomic and then send the output to another table, or even another schema, partitioned by day.

This way, you can keep atomic.events matching snowplow releases, and have another set of data that the BI team can use.

Gal · September 22, 2016, 2:55pm

@13scoobie, I’m working in the BI department at Nir’s company.
Doing it through an SQL ETL process is very inefficient.
The correct way is to copy it directly to the correct table with the correct fields. It saves a lot of time.

mike · September 22, 2016, 10:13pm

It’s hard to know without knowing more about the use case at the moment:

How many rows per day are you loading in atomic.events?
Are you using custom contexts/shredding?
What’s the approximate data size per day?
What are the BI team joining on in atomic.events?

danielz · September 23, 2016, 9:23pm

We have very similar use case and sql works extremely well for us. We load nearly 100mm events every night. Immediately after populating atomic.events we run a set of jobs that create smaller tables with the data we care about and then archive the atomic data older than a few days.

We use Airflow to do all of that but should work with Sqlrunner just the same.

If that’s not good enough you could look into processing the enriched events over map reduce or spark or even plug in to Kinesis to digest events in real time.

alex · September 23, 2016, 11:01pm

Yes - if you need to process the enriched events in say Scalding or Spark, you can use the Snowplow Scala Analytics SDK to work with them.

Gal · September 25, 2016, 7:17am

Hi @alex,

Things like link_click table I totally understand. But, client_session table I can’t really understand.
This data is needed for every event, and it hurts the performance when joining every time.
The best thing would be to perform the enrichment and return the original data + results via json. When it’s through a json file, you can use redshift copy via json, and then it copies only the parameters which are in the table. And the order doesn’t matter then. Also, it would allow to add any free fields as we’d like.

Share your thoughts.

alex · September 25, 2016, 9:20am

Not everybody uses the client_session context:

Some users prefer to implement their own sessionization in their event data modeling phase
Some users are emitting events from server-side and other environments where the concept of a session doesn’t apply

Our long-term roadmap is to factor everything out of atomic.events, so it becomes eventually just a reference map to all the snapshotted entities for the given event. Long-term we can’t assume what entities will be “normal” for an event to contain.

If you know with approximate certainty what entities your events will always contain, it’s totally valid to rebuild your own “fat table” in Redshift prior to analysis - it will be more performant, at the cost of some flexibility.

Gal · September 25, 2016, 9:47am

Hi @alex,

If you know with approximate certainty what entities your events will always contain, it’s totally valid to rebuild your own “fat table” in Redshift prior to analysis - it will be more performant, at the cost of some flexibility.

We know the structure of the table we’d like to have, but we’d also like your great enrichment fields to go there. If it is possible, that would solve everything.

alex · September 26, 2016, 5:58pm

Sure thing - you can add whatever enrichments, self-describing events etc you want into your own fat table.

Topic		Replies	Views
Passing values from atomic.events to a custom table Redshift	5	4000	May 11, 2017
Loading enriched events into IBM dashDB RFCs	2	2637	November 21, 2016
How many events per day is maintainable for atomic.events? Storage targets	2	1408	May 25, 2016
Making SQL data models incremental to improve performance [tutorial] Redshift	11	9206	October 11, 2017
[redshift] unstructured event not save in correct schema Redshift	5	4178	February 27, 2017

Customizing our Snowplow event representation in Redshift

Related Topics