Event_Staged in Data Modeling for new Business requirement

kuangmichael07 · December 17, 2021, 9:10pm

Hi Snowplowers,
Merry Christmas folks!

Our team had great joy playing around with data modeling project using SQL-runner open source version.
A recent found problem is that, when there is a new “Custom” model needs to be derived, we want it to aggregate from the beginning of historical data (atomic.events table) instead of “event_staged” data only.
And after initial run, the new model should be able to adapt to the “event_staged” source table.

What we are doing right now is to maintain two version of code, one for initial run(manually execute), and job’s version code which will be scheduled automatically with the “Standard” flow.

Is there a smarter way to adapt the workflow automatically?

Thank you for sharing your thoughts and wish you all have a good vacation

Will · December 22, 2021, 11:58am

Hi @kuangmichael07,

Glad you have enjoyed the SQL-Runner data models. Backfilling newly introduced custom modules is an interesting problem. I have just created an issue outlining a potential path for us to make the workflow a little easier. This solution would allow for all previously processed events to be served to events_staged without tearing everything down. In addition it would mitigate the need to maintain two version of the model and remove potential duplication of tables.

The harder part is automating a job to handle this automatically. In an ideal world when backfilling you want to only run the modules that are strictly required. Given the custom module could consume from any number of the _staged tables i.e. it might require data from page_view_staged and events_staged, it is tricky to create a generalised workflow that could do this in an automatic manner.

You could either:

Manually create a bespoke job outlining the required playbook for each backfill. More manual work but less costly.
Add the new module to your standard job, switch the backfill toggle and replay all events through all modules. Less work but more costly.

There are several fixes/improvements I want to make to this models in the new year so I will add this functionality to the list of things to explore.

As an aside, we have an equivalent dbt version of the web model, snowplow-web. This handles the backfilling of new modules automatically by making use of dbt’s graph object.

Hope that helps & merry Christmas

Will

kuangmichael07 · December 29, 2021, 5:49pm

@Will ,
Thank you so much for the reply.
I am also curious about how you run the data modeling job in GCP?
Currently we are using Dataproc to host the SQL-runner version data modeling code. And Cloud Composer to schedule hourly job to execute the bash command.
Feel like this is a very naive combination and we will need to copy code from GCS to Dataproc for every code merge.

Thanks again and Happy New Year!

Topic		Replies	Views
Data Modeling - events_staged table is empty For engineers	3	505	November 12, 2021
Making SQL data models incremental to improve performance [tutorial] Redshift	11	9211	October 11, 2017
Snowplow data modeling issues when updating schema SQL Runner	1	863	March 22, 2022
New dbt package! Snowplow-normalize 0.1.0 released New releases	0	912	December 7, 2022
Snowplow_dbt_web: Backfilling sessions_lifecycle_manifest size For data modelers & consumers	8	681	July 7, 2023

Event_Staged in Data Modeling for new Business requirement

Related Topics