Porting web data model to big query & duplicate event handling

Jesse_Redl · March 30, 2020, 10:58pm

I’ve adjusted the query examples within redshift / sql from https://github.com/snowplow/snowplow-web-data-model to work with big query and I had a question regarding de-duplication of events. The web-page-context example is performing this de-duplication here: https://github.com/snowplow/snowplow-web-data-model/blob/master/redshift/sql/01-page-views/00-web-page-context.sql

To replicate this in Big Query I’ve created a view that looks as follows:

 WITH prep AS (
    SELECT
      event_id, 
      dvce_created_tstamp
    FROM `project-id.dataset.events`
    GROUP BY event_id, dvce_created_tstamp
  )
SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER () OVER (PARTITION BY event_id ORDER BY dvce_created_tstamp) AS n
  FROM
    prep)
WHERE
  n = 1

However, I’ve noticed that the event data table includes duplicates when partitioning on the device created timestamp:

Is there a recommendation, on what to partition the event data on? The etl timestamp and device sent timestamp are unique for these 4 examples. Should I use one of those?

Jesse_Redl · March 31, 2020, 1:18am

Missed this in my earlier search. Going to use the collector tstamp as it feels like the best fit for when the event hit us: Duplicate event ids

mike · March 31, 2020, 1:49am

collector_tstamp is good to partition the table by, as for partitioning / removing duplicates keeping the first event_id is probably fine as it looks like you have an instance there of a natural duplicate.

Are you using the event fingerprint enrichment as well? This is often quite helpful in eliminating duplicates (synthetic) and is used for deduplication within the AWS pipeline to determine whether the contents of a payload is different between two events with the same event_id (in the instance where event_ids collide).

Jesse_Redl · March 31, 2020, 10:21pm

Yes, we do have the enrichment enabled, but as of yet I’m not using the fingerprint for any form of de-duplication within the views themselves.

That said, to your point after updating to partitioning by the collector_tstamp I’m in a much better shape where now what appears to be left are real duplicates. Here is another example:

Updating the web page context do remove these natural duplicates looks like this now, which I’m joining too within the other web data model views:

 WITH prep AS (
    SELECT
      event_id, 
      collector_tstamp,
      dvce_sent_tstamp
    FROM `repcore-prod.snowplow_analytics_prod.events`
    GROUP BY event_id, collector_tstamp, dvce_sent_tstamp
  )
SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER () OVER (PARTITION BY event_id ORDER BY collector_tstamp, dvce_sent_tstamp) AS n
  FROM
    prep)
WHERE
  n = 1

Topic		Replies	Views
Duplicate events, using event_id as partition_key Troubleshooting	1	2537	October 20, 2017
De-deduplicating events in Hadoop and Redshift [tutorial] For data modelers & consumers	9	6379	June 23, 2017
Deduplication of events on GCP GCP pipeline	0	1036	October 21, 2019
Event ID or event fingerprint? Enrichment	1	3157	August 17, 2016
Modeling, Deduplication and Architectures Redshift	1	1319	February 25, 2018

Porting web data model to big query & duplicate event handling

Related Topics