Duplicate events, using event_id as partition_key


#1

Hi,

We use snowplow to collect and enrich events and then store them in bigquery. We noticed duplicates, mostly endogenous ones mentioned in this blog. It mentions using event_id as partition_key and I just upgraded collector and enricher versions to 0.10.0 and 0.11.1 respectively, and observe that partition_key matches event_id. I use Python with MultiLangDaemon to read enriched events from AWS Kinesis. Do I need to do anything there as well? The blog mentions “We plan to partition the enriched event stream on event ID, then build a minimal-state deduplication engine as a library that can be embedded in KCL apps.” How do we utilize this?

Thanks,
Naci


#2

Hi @nacivida,

We have a relatively sophisticated event deduplication engine powered by Spark and DynamoDB which is built into our RDB Loader:

At the moment this only works for Postgres and Redshift, and we have not yet extracted this capability out into its own Scala library that can be called from elsewhere.

Separately, we have started work on our port of Snowplow to GCP, which will include loading of BigQuery:

Over time, our BigQuery loader should include an equivalent deduplication engine, although in this case most likely backed by Cloud Bigtable, not DynamoDB. However, this is still some way off.