Duplicate "event_id" fields?


#1

Hi all,

I’m using PhantomJS to ping the Snowplow pipeline we’ve set up (JS tracker on test website -> Scala collector -> Kinesis -> Scala Enricher -> Kinesis -> Proprietary data loader using the SDK to parse Kinesis records).

The “pinger” works in a loop - send a new request once the collector sent a response back.

Lately, I’ve been seeing inconsistencies in the number of results sent and received, and on closer inspection it was because of duplicate "event_id"s.

I’ve read the doc here: http://snowplowanalytics.com/blog/2015/08/19/dealing-with-duplicate-event-ids/#deduplicating-the-event-id , however I’d still like to venture out and ask how is that possible, given the ids are in essense UUIDs.

I’d like to hear your ideas.

Thank you!
Victor.


#2

Hi @vivricanopy - can you clarify exactly how you are doing the reconciliation?

It sounds like you are sending events with known event IDs via PhantomJS and then finding duplicates for those event IDs downstream - post-Kinesis, presumably in your ultimate storage target?

Is that correct?


#3

Hi @alex,

We’re de-facto implementing “first through the gate” using a PK in a sql database. We don’t set the event IDs in the Tracker ourselves, but let the default implementation set them. We also haven’t tried probing them yet to diagnose the matter - I think I will now.


#4

Right, thanks for clarifying.

In terms of duplicates - there are various sources of duplicates, but a likely culprit in your case is the fact that Kinesis performs at least once processing, meaning that any Kinesis worker (e.g. Stream Enrich) can introduce duplicates.


#5

@alex on further investigation, I’m getting one event_id per session - not per request. Is this the correct behaviour?


#6

No, that sounds like something strange is going on…


#7

Ok, thanks! I’ll investigate further and let you know. I tried pinging the browser, too, to exclude the possibility of poorly-written PhantomJS.


#8

@alex I think we’re seeing a similar problem with the batch pipeline when running the cross run event deduplication. We see collisions on the event_id between different Snowplow runs but at the moment we’re not seeing within a run.

For example we have an event_id that is present in 14 events from 14 runs over a one month period which is range available to the query. The finger prints are all different and in some cases the event is a struct event, another page_view.

Have I understood correctly that the event_id in the enriched events output is set during enrichment to a random UUID? If this is the case, aren’t we seeing a surprising number of collisions? We have 128,550,173 event_ids and 128,544,903 of those are distinct, that’s 0.004% IDs colliding.

The v_etl is spark-1.9.0-common-0.25.0.

Thanks
Gareth


#9

Hi @gareth,

It sounds like you’re seeing are what we call synthetic duplicates: events with the same event ID but a different fingerprint. These are often totally different events that, for some reason, got the same event ID. Bots are known to cause these kinds of duplicates, but are also aware of some issues with UUID generation in Javascript that can cause this to happen.

If a duplicate has the same event ID and fingerprint, we call them natural duplicates (in some sense, these are “true” duplicates).

Currently, the in-batch deduplication process removes both natural and synthetic duplicates, whilst the cross-batch deduplication process removes only natural duplicates.

We deduplicate natural duplicates by keeping the first and removing all others. We deduplicate synthetic duplicates by re-assigning new event ID to all affected events.

Hope this answers your question!


#10

Yes it does thanks. I’d not properly taken that in.