Rebuilding custom model in dbt not working

Hi,
I made some changes to a custom model in the snowplow web dbt package so I wanted to replay the events from the beginning so the recently added columns are being filled.

According to the documentation on github this can be done by using:

dbt run --select +snowplow_web_custom_incremental_model --full-refresh --vars 'models_to_remove: snowplow_web_custom_incremental_model'

Which I did, the model was removed from the manifest table but I get this error log in dbt cloud:

10:12:53  Running with dbt=1.1.1
10:12:53  Partial parse save file not found. Starting full parse.
10:12:57  Found 52 models, 137 tests, 8 snapshots, 0 analyses, 551 macros, 4 operations, 1 seed file, 34 sources, 0 exposures, 0 metrics
10:12:57  
10:13:09  
10:13:09  Running 2 on-run-start hooks
10:13:09  1 of 2 START hook: xxx.on-run-start.0 ................................. [RUN]
10:13:09  1 of 2 OK hook: xxx.on-run-start.0 .................................... [OK in 0.00s]
10:13:09  2 of 2 START hook: snowplow_web.on-run-start.0 ................................. [RUN]
10:13:09  2 of 2 OK hook: snowplow_web.on-run-start.0 .................................... [OK in 0.00s]
10:13:09  
10:13:09  Concurrency: 1 threads (target='prod')
10:13:09  
10:13:09  1 of 7 START incremental model dbt_snowplow_manifest.snowplow_web_base_quarantined_sessions  [RUN]
10:13:12  1 of 7 OK created incremental model dbt_snowplow_manifest.snowplow_web_base_quarantined_sessions  [INSERT 0 0 in 3.29s]
10:13:12  2 of 7 START incremental model dbt_snowplow_manifest.snowplow_web_incremental_manifest  [RUN]
10:13:16  2 of 7 OK created incremental model dbt_snowplow_manifest.snowplow_web_incremental_manifest  [INSERT 0 0 in 3.19s]
10:13:16  3 of 7 START table model dbt_scratch.snowplow_web_base_new_event_limits ........ [RUN]
10:13:17  10:13:17 + Snowplow: New Snowplow incremental model. Backfilling
10:13:17  10:13:17 + Snowplow: Processing data between '2020-01-27 00:00:00' and '2020-02-26 00:00:00'
10:13:19  3 of 7 OK created table model dbt_scratch.snowplow_web_base_new_event_limits ... [SELECT in 3.90s]
10:13:19  4 of 7 START snowplow_incremental model dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest  [RUN]
10:13:26  4 of 7 ERROR creating snowplow_incremental model dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest  [ERROR in 6.11s]
10:13:26  5 of 7 SKIP relation dbt_scratch.snowplow_web_base_sessions_this_run ........... [SKIP]
10:13:26  6 of 7 SKIP relation dbt_scratch.snowplow_web_base_events_this_run ............. [SKIP]
10:13:26  7 of 7 SKIP relation dbt_staging.xxx ......... [SKIP]
10:13:27  
10:13:27  Running 2 on-run-end hooks
10:13:27  1 of 2 START hook: xxx.on-run-end.0 ................................... [RUN]
10:13:27  1 of 2 OK hook: xxx.on-run-end.0 ...................................... [OK in 0.00s]
10:13:27  2 of 2 START hook: snowplow_web.on-run-end.0 ................................... [RUN]
10:13:27  2 of 2 OK hook: snowplow_web.on-run-end.0 ...................................... [OK in 0.00s]
10:13:27  
10:13:27  
10:13:27  Finished running 2 incremental models, 3 table models, 2 snowplow_incremental models, 4 hooks in 29.57s.
10:13:27  
10:13:27  Completed with 1 error and 0 warnings:
10:13:27  
10:13:27  Database Error in model snowplow_web_base_sessions_lifecycle_manifest (models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql)
10:13:27    relation "dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest" does not exist
10:13:27    compiled SQL at target/run/snowplow_web/models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql
10:13:27  
10:13:27  Done. PASS=3 WARN=0 ERROR=1 SKIP=3 TOTAL=7

I tried again with the command

dbt run --select +snowplow_web_custom_incremental_model --full-refresh

but no luck either. What am I missing? Why does it tell me the relation is missing even if it’s there and normal dbt runs are working just fine.

I am grateful for any help!

Hey Henry, welcome to the Snowplow discourse community!

I’m not sure why this is happening, but can you confirm that the dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest exists in your database? And what happens if you run the compiled SQL found in target/run/snowplow_web/models/base/manifest/snowplow_web_base_sessions_lifecycle_manifest.sql directly against your database?

Hopefully the answer to both of these things helps us solve the issue you’re facing!
Let me know,
Emiel

1 Like

Hi Emiel,
thanks for the reply! Yes, the table exists:

[2022-07-20 12:37:10] completed in 95 ms
> SELECT t.*
  FROM dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest t
  LIMIT 501
[2022-07-20 12:37:19] 500 rows retrieved starting from 1 in 9 s 202 ms (execution: 9 s 141 ms, fetching: 61 ms)

I can’t seem to access the target folder in dbt cloud in the prod environment, but I see the compiled sql in the debug log of the run.

2022-07-19 10:13:24.136924 (Thread-12): 10:13:24  On model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest: /* {"app": "dbt", "dbt_version": "1.1.1", "profile_name": "user", "target_name": "prod", "node_id": "model.snowplow_web.snowplow_web_base_sessions_lifecycle_manifest"} */

      
    


  create  table
    "XXXXX"."dbt_snowplow_manifest"."snowplow_web_base_sessions_lifecycle_manifest"
    
    diststyle key distkey (session_id)
    
      compound sortkey(start_tstamp)
  as (
    

-- Known edge cases:
-- 1: Rare case with multiple domain_userid per session.





with new_events_session_ids as (
  select
    e.domain_sessionid as session_id,
    max(e.domain_userid) as domain_userid, -- Edge case 1: Arbitary selection to avoid window function like first_value.
    min(e.collector_tstamp) as start_tstamp,
    max(e.collector_tstamp) as end_tstamp

  from "XXXXX"."snowplow_atomic"."events" e

  where
    e.domain_sessionid is not null
    and not exists (select 1 from "XXXXX"."dbt_snowplow_manifest"."snowplow_web_base_quarantined_sessions" as a where a.session_id = e.domain_sessionid) -- don't continue processing v.long sessions
    and e.dvce_sent_tstamp <= 

    dateadd(
        day,
        3,
        dvce_created_tstamp
        )

 -- don't process data that's too late
    and e.collector_tstamp >= 
    cast('2020-01-27 00:00:00' as 
    timestamp without time zone
)
  
    and e.collector_tstamp <= 
    cast('2020-02-26 00:00:00' as 
    timestamp without time zone
)
  
    and app_id in ('com.camparound','com.snowplow') --filter on app_id if provided
    and cast(False as boolean) --don't reprocess sessions that have already been processed.
    

  group by 1
  )

 

, previous_sessions as (
  select *

  from "XXXXX"."dbt_snowplow_manifest"."snowplow_web_base_sessions_lifecycle_manifest"

  where start_tstamp >= 
    cast('2018-01-27 00:00:00' as 
    timestamp without time zone
)
  
  and cast(False as boolean) --don't reprocess sessions that have already been processed.
)

, session_lifecycle as (
  select
    ns.session_id,
    coalesce(self.domain_userid, ns.domain_userid) as domain_userid, -- Edge case 1: Take previous value to keep domain_userid consistent. Not deterministic but performant
    least(ns.start_tstamp, coalesce(self.start_tstamp, ns.start_tstamp)) as start_tstamp,
    greatest(ns.end_tstamp, coalesce(self.end_tstamp, ns.end_tstamp)) as end_tstamp -- BQ 1 NULL will return null hence coalesce
    
  from new_events_session_ids ns
  left join previous_sessions as self
    on ns.session_id = self.session_id

  where
    self.session_id is null -- process all new sessions
    or self.end_tstamp < 

    dateadd(
        day,
        3,
        self.start_tstamp
        )

 --stop updating sessions exceeding 3 days
  )



select
  sl.session_id,
  sl.domain_userid,
  sl.start_tstamp,
  least(

    dateadd(
        day,
        3,
        sl.start_tstamp
        )

, sl.end_tstamp) as end_tstamp -- limit session length to max_session_days

from session_lifecycle sl
  );
  
2022-07-19 10:13:25.935680 (Thread-12): 10:13:25  Postgres adapter: Postgres error: relation "dbt_snowplow_manifest.snowplow_web_base_sessions_lifecycle_manifest" does not exist

Best,
Henry

Hi Henry,

Apologies for the delay in getting back to you, I hope the issue got resolved by now but let me try to help if not.

The reason you are getting the error message is most likely due to the --full-refresh part of the dbt command. For safety reasons the manifest tables are not allowed to be dropped unless the variable snowplow__allow_refresh is set to true. If you look at the compiled code you will see a create table sql script due to the --full-refresh dbt command but as the manifest table still/already exists in your warehouse the query execution fails.

Now that your custom model is removed from the incremental manifest table, and due to this behaviour I would suggest to manually drop the custom table which you would need to reprocess anyway and then run the following instead:

dbt run --select +snowplow_web_custom_incremental_model --vars ‘snowplow__start_date: your_backfill_start_date’

As you run this you should see the new model reintroduced to the incremental manifest table and backfilling to start ( Snowplow: New Snowplow incremental model. Backfilling). After this you can run the usual dbt run --selector snowplow_web command and from this time onwards synching should happen until every model is aligned. (Snowplow incremental models out of sync. Syncing)

I hope this helps. From our side we will have to update the documentation, thank you for letting us know about this issue! Apologies for the inconvenience.

Have a nice weekend,

Agnes

2 Likes