Duplicate event ids

Colm · April 12, 2018, 10:30am

It is normal to see a small amount of duplicates - the Snowplow pipeline is designed to avoid losing any data, so it errs on the side of including a duplicate event rather than risk excluding a valid event. In using the data it is generally easier to deal with duplicates than missing data.

This article which explains duplicates might be helpful in understanding this. More recent versions of the pipeline will deduplicate events, but enabling deduplication of events across batches involves setting up a Dynamo DB instance which will incur additional cost (there aren’t normally many cross-batch duplicates). If the below doesn’t lead to a solution you should post which version of the pipeline and which collector you’re using, as well as which trackers you’re seeing this data come from.

As @jrpeck1989 referenced, a rule of thumb is to always join on event_id = root_id AND collector_tstamp = root_tstamp to avoid a cartesian product in your join (ie. creating further duplicated rows).

My first port of call in finding duplicates is to count them - how many duplicated event_ids do you have as a proportion of the total number of events? If this is relatively low, you can just take measures to exclude them from your queries. If it’s a lot, it’s worth both investigating the cause and deduplicating the data (see the above article).

It’s unusual to see data that’s identical except for the ip address. Are the domain_sessionid and domain_userid always the same? If you’re using the javascript tracker on web, it might be helpful to use the Snowflake Analytics developed in-browser inspector to see if this is an issue in tracking.

I hope this helps, please do keep us updated on progress.

Best,

Topic		Replies	Views
Event's id duplicates a lot of times	1	689	September 15, 2022
Duplicate event_id	4	1308	July 21, 2021
Dealing with duplicate domain_userIDs For data modelers & consumers	3	1323	October 19, 2017
Duplicate "event_id" fields? Tracking SDKs	9	2016	January 5, 2018
Unwanted deduplicated events are getting through Troubleshooting	7	2228	September 29, 2017

Duplicate event ids

Related Topics