Duplicate events, using event_id as partition_key

alex · October 20, 2017, 9:09pm

We have a relatively sophisticated event deduplication engine powered by Spark and DynamoDB which is built into our RDB Loader:

At the moment this only works for Postgres and Redshift, and we have not yet extracted this capability out into its own Scala library that can be called from elsewhere.

Separately, we have started work on our port of Snowplow to GCP, which will include loading of BigQuery:

Over time, our BigQuery loader should include an equivalent deduplication engine, although in this case most likely backed by Cloud Bigtable, not DynamoDB. However, this is still some way off.

Topic		Replies	Views
Porting web data model to big query & duplicate event handling For engineers	3	1496	March 31, 2020
Event ID or event fingerprint? Enrichment	1	3153	August 17, 2016
Snowplow Events from Google Bucket to BigQuery Storage targets	1	994	July 29, 2020
Deduplication of events on GCP GCP pipeline	0	1035	October 21, 2019
Snowplow Event Recovery on GCP GCP pipeline	4	541	September 27, 2023

Duplicate events, using event_id as partition_key

Related Topics