Duplicate event ids

Thinking about this a little further, I should add some detail.

Sounds like a tracking issue.

This isn’t necessarily the case actually. As the dealing with duplicates post details, there are several potential causes of duplicates. In this case, it can be very hard to pin down the causes, as it could be something external to Snowplow. Case in point - there’s an open issue with the UUID generation in some browsers.

It is also possible that this is caused by bot traffic - bots and crawlers operating on your website can cause behaviour like this. Checking the useragent might give you a clue if this is the case, but beware that you’ll only see something obvious if it’s a ‘friendly’ bot (like a Google bot or soemthing), many bots will fake the useragent.

In terms of dealing with them, various releases of the pipeline have in-built deduplication - 76 introduced in-batch natural duplicate deduplication, 86 for in-batch synthetics, and 88 for cross-batch naturals.

(In the blog above, natural duplicates are referred to as endogenous, and synthetic as exogenous).