Event ID or event fingerprint?

christophe · August 17, 2016, 5:16pm

There are 2 classes of duplicates.

Either an event was sent in twice or duplicated somewhere in the pipeline, or a different event was sent in with an event ID that was used before (these are often - but not always - sent in by bots).

We generate an event fingerprint during enrichment to be able to distinguish between the two. In the former case, 2 or more rows will have the same event ID and event fingerprint. In the latter, the rows will have the same event ID but a different fingerprint.

I recommend reading this guide for more background information on duplicates: De-deduplicating events in Hadoop and Redshift [tutorial]

For streaming events into BigQuery, it’s a good question. It depends. If you use the event ID, you will remove both kinds of duplicates - also the ones that share event ID but are different. If using the event fingerprint, you’ll deduplicate the first kind, but not the second. Does that make sense?

Our plan to address duplicates is to fully deduplicate the first kind, but give new event IDs to the second (since these are different events).

Hope this helps,

Christophe

Topic		Replies	Views
Deduplication of events on GCP GCP pipeline	0	1027	October 21, 2019
Duplicate events, using event_id as partition_key Troubleshooting	1	2519	October 20, 2017
Bulk Postgres Loader Deduplication	1	807	April 10, 2020
De-deduplicating events in Hadoop and Redshift [tutorial] For data modelers & consumers	9	6360	June 23, 2017
Porting web data model to big query & duplicate event handling For engineers	3	1478	March 31, 2020

Event ID or event fingerprint?

Related Topics