Event ID or event fingerprint?


#1

I’m streaming events into BigQuery that accepts an insertId to de-duplicate on insert and I was wondering if I should use event_fingerprint or event_id?

Is event_id still around for legacy reasons?

Thanks


#2

Hi @Shin,

There are 2 classes of duplicates.

Either an event was sent in twice or duplicated somewhere in the pipeline, or a different event was sent in with an event ID that was used before (these are often - but not always - sent in by bots).

We generate an event fingerprint during enrichment to be able to distinguish between the two. In the former case, 2 or more rows will have the same event ID and event fingerprint. In the latter, the rows will have the same event ID but a different fingerprint.

I recommend reading this guide for more background information on duplicates: De-deduplicating events in Hadoop and Redshift [tutorial]

For streaming events into BigQuery, it’s a good question. It depends. If you use the event ID, you will remove both kinds of duplicates - also the ones that share event ID but are different. If using the event fingerprint, you’ll deduplicate the first kind, but not the second. Does that make sense?

Our plan to address duplicates is to fully deduplicate the first kind, but give new event IDs to the second (since these are different events).

Hope this helps,

Christophe