Best approach to incremental data modeling? (etl_tstamp vs. staging)


#1

I read a little bit about data modeling and there are two different approaches I can find. One is in the snowplow documents and the other in the example modeling playbook for the SQL runner.

There is one approach that you take the etl_tstamp and make sure to only process every event for one etl_tstamp once.

The other approach as you can see in the SQL runner playbook example, uses different schemas like landing, atomic, derived to do data modeling. So we first add all the events into a landing schema and only after we data modeled all the events we append all the data into the atomic schema.

the 2nd approach seems really heavy weight and we would have to create every table twice and do all the copying stuff. I’m not 100% sure of how I would do the 1st approach with writing the processed etl_tstamps in a seperate table and counting all the events for a particiular etl_tstamp and I’m not sure what could be the advantages or disadvantages for both approaches.


#2

Hi @tclass,

The examples on GitHub are outdated and will be removed soon (to be replaced with improved documentation & tutorials). You are correct about the second approach - we no longer recommend creating a landing schema for those exact reasons.

Our current approach is more like what I outlined here: Making SQL data models incremental to improve performance [tutorial]

Hope this helps!

Christophe