Bulk Postgres Loader Deduplication

I am reading the store data and i am sending the events with different event id, different event fingerprint and etl timestamp could be same. Would it be a deduplicate event?

EventId EventFingerPrint etl_tstamp

E1 EFP1 T1
E2 EFP2 T1
E3 EFP3 T2

EventId EventFingerPrint etl_tstamp

E3 EFP3 T1
E4 EFP4 T2

@ScalaEnthu, there is nothing wrong with having the same ETL timestamp - it simply indicates the events were processed in the pipeline at the same time (same batch). Different event IDs and payloads (event fingerprints) mean different events.

When talking about duplicates, it is important to distinguish between what we call natural and synthetic duplicates. They have different causes.

Natural duplicates are most frequently a byproduct of the tracker re-sending events when it has failed to receive confirmation that they have reached the collector. This is done to minimise that risk of data loss. The result could be events with the same event_id and the same payload (event_fingerprint) but with different collector_tstamp.

A similar result could take place in the real-time pipeline itself dues to at-least-once processing semantics. Again, the “at-least-once” processing is deployed to eliminate data loss.

Synthetic duplicates are events that have the same event_id but different payloads. In other words, these are not duplicate events (the payload - event_fingerprint - is different), but rather collisions in the UUID for the event_id field.

Thus, in summary, the following are the reasons for duplicated events

  1. Client-side environment causes events to be sent in with the same ID
  2. Events are sometimes duplicated within the Snowplow pipeline itself