Data Modeling on GCP

Hi my favorite Snowplowers,
Our team is trying to figure out the best way to run data modeling on GCP.
Currently we are using Dataproc as the server to host data models repo code and sql-runner, and Cloud Composer(Airflow) to schedule the jobs.

But we don’t think this is a smart way of using GCP, since Dataproc can’t call the data-models repo scripts/sql, which is in GCS directly. We have to copy the code from GCS and use Apache pig to execute the modeling process.

Would you please shed some lights on what would be the best GCP tools combination running data modeling? Thank you

You have a few options as to which models to run which might influence how you run them:

dbt: GitHub - snowplow/dbt-snowplow-web: A fully incremental model, that transforms raw web event data generated by the Snowplow JavaScript tracker into a series of derived tables of varying levels of aggregation.
dataform: GitHub - snowplow-incubator/dataform-data-models: Snowplow Incubator project for Dataform SQL data models for working with Snowplow data. Supports BigQuery only
sql-runner: GitHub - snowplow/data-models: Snowplow offically-supported SQL data models for working with Snowplow web and mobile behavioral data. Supports Redshift, Snowflake and BigQuery.

Of these, dbt is probably the most flexible and well-supported one to go with.

Obviously with dbt or dataform you have the option to use their hosted service to orchestrate the models themselves. I know that some people are using Airflow to orchestrate their models, but I don’t know much about airflow so I’m not of much use there.

If you need to host orchestration yourself on GCP, one relatively simple option is to build an app Engine app which calls one of the above tools. It should be relatively simple to trigger executions of the app based on a schedule, and the configuration of dbt, sql-runner or dataform takes care of the rest.

Otherwise, it’s perfectly fine to orchestrate one of those tools into any other job scheduling environment you might use. Our internal orchestration has moved to Hashicorp Nomad, but we previously used our own self-built cron scheduler which simply executed jobs in a bash environment. All of those worked just fine!

Long story short the most hassle-free instrumentation IMO will be to use a sql-running tool like dbt to run the sql itself, and that just leaves you needing something to schedule and trigger the jobs, which can be solved in a few ways. Hope that’s helpful :slight_smile:

1 Like