Duplicate event ids


#1

Hi,

We use snowplow tracker library for event tracking. What we are seeing are records with same event_id, network_userid combinations. all the fields including these two are same for the duplicate records except for user_ipaddress. how can this be possible? please help.

thanks!


#2

Have you ran a query with collector_tstamp as well?


#3

Hi @Geetha,

It is normal to see a small amount of duplicates - the Snowplow pipeline is designed to avoid losing any data, so it errs on the side of including a duplicate event rather than risk excluding a valid event. In using the data it is generally easier to deal with duplicates than missing data.

This article which explains duplicates might be helpful in understanding this. More recent versions of the pipeline will deduplicate events, but enabling deduplication of events across batches involves setting up a Dynamo DB instance which will incur additional cost (there aren’t normally many cross-batch duplicates). If the below doesn’t lead to a solution you should post which version of the pipeline and which collector you’re using, as well as which trackers you’re seeing this data come from.

As @jrpeck1989 referenced, a rule of thumb is to always join on event_id = root_id AND collector_tstamp = root_tstamp to avoid a cartesian product in your join (ie. creating further duplicated rows).

My first port of call in finding duplicates is to count them - how many duplicated event_ids do you have as a proportion of the total number of events? If this is relatively low, you can just take measures to exclude them from your queries. If it’s a lot, it’s worth both investigating the cause and deduplicating the data (see the above article).

It’s unusual to see data that’s identical except for the ip address. Are the domain_sessionid and domain_userid always the same? If you’re using the javascript tracker on web, it might be helpful to use the Snowflake Analytics developed in-browser inspector to see if this is an issue in tracking.

I hope this helps, please do keep us updated on progress.

Best,


#4

thanks for all the responses. collisions could happen but for some events, we see as high as 110 dups for some records all with the same dtm value! dvce_sent_tstamp (stm) is different though. collector_tstamp is different too for these events. is that possible?


#5

yes we are seeing a really high number of duplicates this time around. sometimes the ip address is different, sometimes thats the same too but stm is different and so is the collector_tstamp. we are seeing.

count(*) on events is double that of count(distinct event_id) in our case


#6

dvce_sent_tstamp (stm) is different though. collector_tstamp is different too for these events. is that possible?

Sounds like a tracking issue. I’d take a look at your tracking setup and figure out where events are being sent twice. If it’s web tracking the inspector I linked to in my last response is best, otherwise you could set up a Snowplow Mini instance and debug from there.

I would also check the actual tracking code - are you calling the track method twice? What tracker(s) are you using? Is the duplicated data coming from one particular tracker?


#7

Thinking about this a little further, I should add some detail.

Sounds like a tracking issue.

This isn’t necessarily the case actually. As the dealing with duplicates post details, there are several potential causes of duplicates. In this case, it can be very hard to pin down the causes, as it could be something external to Snowplow. Case in point - there’s an open issue with the UUID generation in some browsers.

It is also possible that this is caused by bot traffic - bots and crawlers operating on your website can cause behaviour like this. Checking the useragent might give you a clue if this is the case, but beware that you’ll only see something obvious if it’s a ‘friendly’ bot (like a Google bot or soemthing), many bots will fake the useragent.

In terms of dealing with them, various releases of the pipeline have in-built deduplication - 76 introduced in-batch natural duplicate deduplication, 86 for in-batch synthetics, and 88 for cross-batch naturals.

(In the blog above, natural duplicates are referred to as endogenous, and synthetic as exogenous).