@ScalaEnthu, there is nothing wrong with having the same ETL timestamp - it simply indicates the events were processed in the pipeline at the same time (same batch). Different event IDs and payloads (event fingerprints) mean different events.
When talking about duplicates, it is important to distinguish between what we call natural and synthetic duplicates. They have different causes.
Natural duplicates are most frequently a byproduct of the tracker re-sending events when it has failed to receive confirmation that they have reached the collector. This is done to minimise that risk of data loss. The result could be events with the same
event_id and the same payload (
event_fingerprint) but with different
A similar result could take place in the real-time pipeline itself dues to at-least-once processing semantics. Again, the “at-least-once” processing is deployed to eliminate data loss.
Synthetic duplicates are events that have the same
event_id but different payloads. In other words, these are not duplicate events (the payload -
event_fingerprint - is different), but rather collisions in the UUID for the
Thus, in summary, the following are the reasons for duplicated events
- Client-side environment causes events to be sent in with the same ID
- Events are sometimes duplicated within the Snowplow pipeline itself