Duplicate event ids

Hi @Geetha,

It is normal to see a small amount of duplicates - the Snowplow pipeline is designed to avoid losing any data, so it errs on the side of including a duplicate event rather than risk excluding a valid event. In using the data it is generally easier to deal with duplicates than missing data.

This article which explains duplicates might be helpful in understanding this. More recent versions of the pipeline will deduplicate events, but enabling deduplication of events across batches involves setting up a Dynamo DB instance which will incur additional cost (there aren’t normally many cross-batch duplicates). If the below doesn’t lead to a solution you should post which version of the pipeline and which collector you’re using, as well as which trackers you’re seeing this data come from.

As @jrpeck1989 referenced, a rule of thumb is to always join on event_id = root_id AND collector_tstamp = root_tstamp to avoid a cartesian product in your join (ie. creating further duplicated rows).

My first port of call in finding duplicates is to count them - how many duplicated event_ids do you have as a proportion of the total number of events? If this is relatively low, you can just take measures to exclude them from your queries. If it’s a lot, it’s worth both investigating the cause and deduplicating the data (see the above article).

It’s unusual to see data that’s identical except for the ip address. Are the domain_sessionid and domain_userid always the same? If you’re using the javascript tracker on web, it might be helpful to use the Snowflake Analytics developed in-browser inspector to see if this is an issue in tracking.

I hope this helps, please do keep us updated on progress.

Best,

2 Likes