I’ve adjusted the query examples within redshift / sql from https://github.com/snowplow/snowplow-web-data-model to work with big query and I had a question regarding de-duplication of events. The web-page-context example is performing this de-duplication here: https://github.com/snowplow/snowplow-web-data-model/blob/master/redshift/sql/01-page-views/00-web-page-context.sql
To replicate this in Big Query I’ve created a view that looks as follows:
WITH prep AS ( SELECT event_id, dvce_created_tstamp FROM `project-id.dataset.events` GROUP BY event_id, dvce_created_tstamp ) SELECT * FROM ( SELECT *, ROW_NUMBER () OVER (PARTITION BY event_id ORDER BY dvce_created_tstamp) AS n FROM prep) WHERE n = 1
However, I’ve noticed that the event data table includes duplicates when partitioning on the device created timestamp:
Is there a recommendation, on what to partition the event data on? The etl timestamp and device sent timestamp are unique for these 4 examples. Should I use one of those?