Porting web data model to big query & duplicate event handling

I’ve adjusted the query examples within redshift / sql from https://github.com/snowplow/snowplow-web-data-model to work with big query and I had a question regarding de-duplication of events. The web-page-context example is performing this de-duplication here: https://github.com/snowplow/snowplow-web-data-model/blob/master/redshift/sql/01-page-views/00-web-page-context.sql

To replicate this in Big Query I’ve created a view that looks as follows:

 WITH prep AS (
    SELECT
      event_id, 
      dvce_created_tstamp
    FROM `project-id.dataset.events`
    GROUP BY event_id, dvce_created_tstamp
  )
SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER () OVER (PARTITION BY event_id ORDER BY dvce_created_tstamp) AS n
  FROM
    prep)
WHERE
  n = 1

However, I’ve noticed that the event data table includes duplicates when partitioning on the device created timestamp:

Is there a recommendation, on what to partition the event data on? The etl timestamp and device sent timestamp are unique for these 4 examples. Should I use one of those?

Missed this in my earlier search. Going to use the collector tstamp as it feels like the best fit for when the event hit us: Duplicate event ids

collector_tstamp is good to partition the table by, as for partitioning / removing duplicates keeping the first event_id is probably fine as it looks like you have an instance there of a natural duplicate.

Are you using the event fingerprint enrichment as well? This is often quite helpful in eliminating duplicates (synthetic) and is used for deduplication within the AWS pipeline to determine whether the contents of a payload is different between two events with the same event_id (in the instance where event_ids collide).

Yes, we do have the enrichment enabled, but as of yet I’m not using the fingerprint for any form of de-duplication within the views themselves.

That said, to your point after updating to partitioning by the collector_tstamp I’m in a much better shape where now what appears to be left are real duplicates. Here is another example:

Updating the web page context do remove these natural duplicates looks like this now, which I’m joining too within the other web data model views:

 WITH prep AS (
    SELECT
      event_id, 
      collector_tstamp,
      dvce_sent_tstamp
    FROM `repcore-prod.snowplow_analytics_prod.events`
    GROUP BY event_id, collector_tstamp, dvce_sent_tstamp
  )
SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER () OVER (PARTITION BY event_id ORDER BY collector_tstamp, dvce_sent_tstamp) AS n
  FROM
    prep)
WHERE
  n = 1