Using a manifest to keep track of what data has been modeled (incremental modeling)


#1

A common pattern amongst our users is to use this as a manifest to record what data has been modeled, and what hasn’t. So, at the start of the data modeling process, you would run a query like this to identify all the runs that have occurred in the last week, and identify any runs that have not yet been processed:

INSERT INTO scratch.etl_tstamps (

  WITH recent_etl_tstamps AS ( -- return all recent ETL timestamps

    SELECT DISTINCT etl_tstamp FROM atomic.events
    WHERE collector_tstamp > DATEADD(week, -1, CURRENT_DATE) -- restrict table scan to the last week (collector_tstamp is SORTKEY) for performance reasons
    ORDER BY 1

  )

  -- store all ETL timestamps that are not in the manifest (i.e. that have NOT been processed)

  SELECT DISTINCT etl_tstamp FROM recent_etl_tstamps
  WHERE etl_tstamp NOT IN (SELECT etl_tstamp FROM datamodeling.etl_tstamps ORDER BY 1)
  ORDER BY 1

);

The values in scratch.etl_tstamps are then used to subset the data that subsequent aggregations are computed on. (So that only the new data is aggregated, and then these aggregates are combined with aggregates that were computed on previous runs.)

Once the data modeling step has been completed successfully, the etl_tstamp values that have now been processed are added to the manifest as one of the final steps in the data model:

ALTER TABLE datamodeling.etl_tstamps APPEND FROM scratch.etl_tstamps;