Duplicate events, using event_id as partition_key

nacivida · October 20, 2017, 4:55pm

Hi,

We use snowplow to collect and enrich events and then store them in bigquery. We noticed duplicates, mostly endogenous ones mentioned in this blog. It mentions using event_id as partition_key and I just upgraded collector and enricher versions to 0.10.0 and 0.11.1 respectively, and observe that partition_key matches event_id. I use Python with MultiLangDaemon to read enriched events from AWS Kinesis. Do I need to do anything there as well? The blog mentions “We plan to partition the enriched event stream on event ID, then build a minimal-state deduplication engine as a library that can be embedded in KCL apps.” How do we utilize this?

Thanks,
Naci

alex · October 20, 2017, 9:09pm

Hi @nacivida,

We have a relatively sophisticated event deduplication engine powered by Spark and DynamoDB which is built into our RDB Loader:

At the moment this only works for Postgres and Redshift, and we have not yet extracted this capability out into its own Scala library that can be called from elsewhere.

Separately, we have started work on our port of Snowplow to GCP, which will include loading of BigQuery:

Over time, our BigQuery loader should include an equivalent deduplication engine, although in this case most likely backed by Cloud Bigtable, not DynamoDB. However, this is still some way off.

Topic		Replies	Views
Porting web data model to big query & duplicate event handling For engineers	3	1480	March 31, 2020
Event ID or event fingerprint? Enrichment	1	3140	August 17, 2016
Snowplow Events from Google Bucket to BigQuery Storage targets	1	983	July 29, 2020
Deduplication of events on GCP GCP pipeline	0	1028	October 21, 2019
Snowplow Event Recovery on GCP GCP pipeline	4	534	September 27, 2023

Duplicate events, using event_id as partition_key

Related Topics