Dealing with duplicate event IDs for joining contexts to atomic.events


#1

Hi guys,

We’re beginning to dabble in custom contexts more and we’re running into troubles when joining events to their shredded contexts. I’ve seen a few posts on how duplicate event_ids are caused and how to deal with them in Redshift/Enrichment but I haven’t seen any discussions about how to deal with them during analysis.

We’re reasonably comfortable that when joining shredded unstructured events to atomic.events, we’ll join accurately on event_id = root_id. Mostly because we can constrain unstructured events more tightly in a query:

SELECT event_name, count(*) AS rows, count(DISTINCT event_id) AS event_ids, count(DISTINCT event_fingerprint) AS event_fingerprints
FROM atomic.events e
INNER JOIN com_acme_custom_event_1 s
ON event_id = root_id
WHERE collector_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1 AND root_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1
AND event_name = 'custom_event'
GROUP BY 1
ORDER BY 2 DESC;

Which returns exactly what you would expect:

event_name	rows         	event_ids	event_fingerprints
custom_event	2663663872	5320837  	5367762

Yes, we get some wonky joins (see that row number), but when further constrained (excluding bots, tracking a certain app_id, filtering by hostname etc) nothing that impacts analysis too badly.

However when we take away the “event_name” constraint (e.g. as the case may be with custom contexts), Redshift joins the custom_event table to all the rows matching the event_id… including some that are completely different events:

SELECT event_name, count(*) AS rows, count(DISTINCT event_id) AS event_ids, count(DISTINCT event_fingerprint) AS event_fingerprints
FROM atomic.events e
INNER JOIN com_acme_custom_event_1 s
ON event_id = root_id
WHERE collector_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1 AND root_tstamp BETWEEN CURRENT_DATE-31 AND CURRENT_DATE+1
GROUP BY 1
ORDER BY 2 DESC;

Out of the above query I’ll normally get something like this:

event_name	rows         	event_ids	event_fingerprints
custom_event	2663663872	5320837   	5367762
other_event	21337975	21            	14058
page_view	5544808  	25            	6365
page_error	52831      	5              	72
another_event	21855      	2             	64
last_event	1       	1            	1

Custom event rows jump a lot, and it starts to match “other_event”, “page_view” and “page_error” events. Obviously if I were to do a simple event_id = root_id join for custom contexts, they may be subject to joining across many events for which the context wasn’t attached.

How can we constrain contexts so they join back to the precise event they occurred on?

Should we be joining contexts on event_id AND collector_tstamp?

Would event fingerprints be a better key to join the tables in in future?

Cheers,
Rob


De-deduplicating events in Hadoop and Redshift [tutorial]
Is there a way to join page ping events to page views?
#2

Hi Rob,

Let me start with a bit of background on duplicates (for people who are new to the issue) and what we can and plan to do about them:

  1. A small percentage of Snowplow events does not have a unique event ID. This is a known issue. It can cause problems in Redshift because (a) some joins become cartesian products and (b) the relation between some unstructured events and contexts and their parent event is lost when events are shredded. For more background on duplicates, I’d recommend reading the first 2 sections of this blogpost: http://snowplowanalytics.com/blog/2015/08/19/dealing-with-duplicate-event-ids/

  2. We are working on a permanent solution which we plan to release later in 2016.

  3. If you are running Snowplow 76 or later, and have enabled the event fingerprint enrichment, almost all duplicates will be caught in hadoop-shred (i.e. prior to shredding the events and loading them into Redshift). It doesn’t stop all duplicates yet because (a) it only de-duplicates natural duplicates (i.e. same event ID and event fingerprint) and (b) it doesn’t deduplicate across ETL runs. Nevertheless, we have seen this reduce the number of duplicates in Redshift by 95 to 99%, with a few exceptions (clients with a lot of synthetic duplicates).

  4. To de-duplicate the events that are already in Redshift, I’d recommend looking into our deduplication queries: http://snowplowanalytics.com/blog/2015/10/15/snowplow-r72-great-spotted-kiwi-released/#deduplication

So, to answer your specific question. I’d still recommend upgrading to the latest version of Snowplow (if you haven’t done so already), to reduce the number of duplicates going forward.

That would indeed be my first recommendation. How does this affect the numbers you shared?

That’s not something we plan to do, as we are working on a solution that can handle all duplicates before they are shredded and loaded into Redshift.

Hope this helps!

Christophe


#3

Thanks @christophe - yep joining on both tstamp and id helps quite a bit. As suspected it removes all but a small handful of other events. Along with proper constraints, I bet this will be negligible in no time.

We’re on the latest batch flow of Snowplow and loving the event_fingerprint deduplication. It’s reduced duplicates for us substantially.

Going to have to read up on the Redshift deduping.


#4

That’s good to hear! Duplicate events almost never arrive at the exact same time, which is why joining on both collector_tstamp and event_did works so well for duplicates that were already loaded into Redshift.

Let me know if you have any questions about the deduplication queries.