Snowplow_web dbt package not producing rows in database

We are using dbt Cloud for hosting our installation and have setup and can run the snowplow_web package successfully. The builds pass and the supporting tables (derived, manifest, scratch) all are created with their subsequent schemas, however none of the tables are populating. The package continues to produce zero rows.

We are running BigQuery.

Here is our project yml entry for the package:

Snowplow

vars:
snowplow_web:
snowplow__atomic_schema: datalake
snowplow__database: challenger-dev-231320
snowplow__events: datalake.snowplow_events
snowplow__enable_iab: false
snowplow__enable_ua: false
snowplow__enable_yauaa: false
snowplow__derived_tstamp_partitioned: false

realized I should add more information :slight_smile:

packages:

  • package: snowplow/snowplow_web
    version: 0.6.2

  • package: snowplow/snowplow_utils
    version: 0.9.0

Hey @CSlovak, welcome to the Snowplow community!

I think what might be happening is that you didn’t set the snowplow__start_date value which means the web package is trying to process data from the default starting date (which is 2020-01-01). It also looks like you have the default value for snowplow__backfill_limit_days which is 30, this therefore means that the web package will be looking in the date range of 2020-01-01 to 2020-01-31 for data to process. Since there is (presumably) no data yet for this date range, the web package creates empty tables but does not update it’s manifest to say that it has processed data in this date range, since no actual data was processed. As a result, on the next run the web package once again searches this date range for data to process, and again finds nothing. To immediately resolve this problem (if I understand it correctly), you’ll need to update your snowplow__start_date value to something a bit more recent, when you first started generating data in your events table.

If you’re interested, here’s an explanation of how our package works at a high level and why it runs into this problem. We use a series of macros to generate and maintain a manifest table, which essentially keeps track of each “actual” table that the web package generates (in the scratch and derived schemas) and what the latest timestamp is of data processed for that table. This allows us to very easily “catch-up” in case parts of a dbt run fail during any run, and also ensures that without changing any parameters in the dbt project, the web tables will remain as up to date as possible with every run (assuming you run it frequently enough – which in the default is more frequently than every 30 days). However, when there is no data in the source tables to process, our manifest does not update in order to allow for the data to be loaded into the events table in case it is late-arriving.

I hope this clarifies things and helps resolve your issue, but if it doesn’t or if you have any more questions don’t hesitate to let me know!

Have a great day,
Emiel

2 Likes