De-deduplicating events in Hadoop and Redshift [tutorial]

A question we often get is how to deal with duplicates. It’s possible – but rare – for 2 or more events to have the same event ID. This can, in some cases, cause problems in Redshift when an inner or left join becomes a cartesian product.

We have written about this topic a couple of times before, but below is an overview of our latest thinking on duplicates.

1. What is causing events to be duplicated?

Duplicates are possible because we set the event ID in the tracker rather than in the pipeline. The reason we do it like this is because it allows us to spot unusual events (like duplicates), but until R76, we didn’t prevent duplicates from being loaded into Redshift.

There are 2 main mechanisms responsible for creating duplicates (for a more detailed explanation, read the first 2 sections of this blogpost).

The first reason is that we have designed the pipeline to minimise the risk of data loss. For example:

  • Trackers cache the event until confirmation has been received from the collector. However, if the collector logs the event but the tracker doesn’t receive the confirmation from the collector, it will send the event again, and the same event will appear twice in Redshift.

These events are true duplicates. The actual tracker payload that was sent to the collector is identical (fields like etl_tstamp that are set later on can still be different because events can arrive at different times).

The second reason, and this is limited to the Javascript tracker, is that we don’t control the client-side environment in which the tracker runs. Some bots and spiders execute Javascript but don’t return a real UUID. The result is that all events from this spider or bot will have the same event ID. This issue is not limited to bots, we have come across rare cases where content blockers interfered with the Javascript tracker which caused it to produce duplicates.

These duplicates (which are not duplicates in the true sense of the word) are less common. Because all events from this user or bot will have the same event ID, it’s possible to see several thousand events (or more) that all have the same event ID.

2. De-deduplication process in Hadoop

We released Snowplow R76 with a first version of the de-duplication process that runs on Hadoop. This process runs before events are shredded and stops between 95 and 99% of duplicates from being loaded into Redshift.

The current implementation partitions events on event ID and event fingerprint and deletes all but the first event in each partition. In other words, it de-duplicates all natural duplicates that are in the same batch.

It does not:

  • de-deduplicate between batches
  • de-duplicate synthetic duplicates

We are working on 2 more releases that will cover these 2 cases – both will be released later in 2016.

3. How to deal with existing duplicates in Redshift?

The de-deduplication process that runs on Hadoop stops most new duplicates from being loaded into Redshift, but it doesn’t remove existing duplicates.

Below are a couple of solutions to deal with existing duplicates.

3.1 Join on both event ID and collector timestamp

Most duplicates have the same event ID but a different collector timestamp (this is true for all kinds of duplicates). The easiest solution to get rid of cartesian products is therefore to join tables on both event_id = root_id and collector_tstamp = root_tstamp.

To count the number of duplicates in each table: SELECT count, COUNT(*) FROM (SELECT event_id, COUNT(*) FROM ... GROUP BY 1) GROUP BY 1 ORDER BY 1

This can be used to assess which tables are affected (not all tables have duplicates). It’s also recommended to compare the results before and after updating the join to assess whether this is a good enough solution or not.

Related post: Dealing with duplicates when joining contexts to atomic.events

3.2 De-deduplicate as part of the data modeling process

Loading atomic.events into Redshift is almost never the last step in the pipeline. Most Snowplow users have a data modeling process which transforms and aggregates events into a second set of tables, which are then made available to end users in the business.

If the number of events that are modelled with each run is small enough, it’s possible to deduplicate events before tables are joined. For example, one approach would be to partition on event ID (and/or event fingerprint) and use ROW_NUMBER (documentation) to pick the first within each partition.

3.3 Remove duplicates using SQL

Another solution is to remove duplicates from atomic.events and related tables altogether. We released a set of SQL queries that does that. These can be run once or after each run (although doing so can be slow).

Here’s how to run them:

  • create a schema duplicates

  • create a table duplicates.events identical to atomic.events (table definition)

The next couple of steps depend on whether all events were enriched with the event fingerprint enrichment or not. If the event fingerprint is never NULL, then it’s sufficient to use 01-events.sql.

If not, the following steps will need to be run twice: once using 01-events.sql, a second time using 02-events-without-fingerprint.sql. The first will deduplicate all events with a fingerprint, the second will deduplicate all events without using the fingerprint (which is a lot slower – consider resizing the Redshift cluster to speed things up).

The steps in both cases are the same:

  • steps (a) to © deal with natural duplicates: move all but the first to duplicates.events

  • step (d) is an optional step that moves all remaining duplicates to duplicates.events (this removes duplicates with a different fingerprint)

  • if running both 01-events.sql and 02-events-without-fingerprint.sql, don’t run step (d) until the second time

  • vacuum

De-duplicating unstructured event and context tables requires a bit more work. We published an example to help with this process. Here’s how to use the example:

This will have to be repeated for all tables that have a significant number of duplicates (more than 1% of rows).

Let us know if you have any questions!

2 Likes

Thanks for posting this @christophe. Somewhat ironically, we spent some time cleaning-up duplicate events in Redshift earlier today.

After some research, we ended-up deciding on a hybrid approach to manage duplicates in our atomic.events table and our unstructured event / context tables that fits into our current pipeline quite well. I am curious if you have any thoughts on this approach.

  • To handle duplicates in atomic.events, we run the first two SQL queries provided immediately after data is loaded to Redshift (01-events.sql, and 02-events-without-fingerprint.sql)
  • Using Looker, we created PDTs doing a SELECT DISTINCT on each of our unstructured event / context tables and reference these only by joining from our deduped atomic.events table. This way, we only get a single row for each event / unique unstructured event / context in a given query

We like this approach because it allows us to use the SQL provided by your team to manage the deduplication process for our events table (which will always be consistent with Snowplow’s definition), while not having to maintain scripts for each unstructured event / context table (which would be custom to us). Also, while we’re not tracking multiple entries in the same context table for a single event now, it gives us the flexibility to do so in the future.

Hi @tfinkel - that’s indeed another way to deal with duplicates in Redshift. Thanks for sharing!

Duplicate events are back on our radar, and we were hoping you could provide some guidance. Although we run both the 01-events.sql and 02-events-without-fingerprint.sql queries to de-dupe after each run of the batch pipeline, we’ve recently uncovered a swath of events with duplicate IDs in the events table.

Upon further investigation, these events only differ in one or more of the following fields - dvce_sent_tstamp, user_ipaddress, and useragent. It appears that most of these events were created before we upgraded to R76 (as mentioned above), however, we still need to figure out a way to deal with them.

One hypothesis is that these events are being sent multiple times by the tracker, both JavaScript + iOS. I’m unsure of the best solution to this issue, however, my current thinking is that it might make sense to loosen the restrictions on the event fingerprint to exclude these three fields and de-dupe that way.

Can you offer any guidance? Thanks!

Hi @tfinkel,

I think it indeed makes sense to loosen the restrictions on the event fingerprint (the one generated in SQL) a bit. This will let you keep the first, and move the others out of atomic.

It’d still be interesting to investigate what is generating these events. If it were just the dvce_sent_tstamp, then the most likely explanation would be that the tracker didn’t receive a response from the collector and sent the event again. However, a change in user_ipaddress and useragent is more surprising.

Does it look like these events are coming from the same device? Do the other fields look normal, or is there something that could suggest it’s bot traffic? Is it just the dvce_sent_tstamp that changes in events coming from the iOS tracker, or also these other fields?

Christophe

Thank you, glad that sounds lke a good approach.

Agreed, I’m not sure why those fields would change. I misspoke regarding the iOS tracker, all of these events are coming from the JavaScript tracker, and any number of these fields (dvce_sent_tstamp, user_ipaddress, and useragent) may differ from event to event. I’m not seeing any patterns that would indicate a bot, however, if you have any suggested approaches for determining that, I can take a deeper look.

Another interesting note, the devce_sent_tstamp lag between the first and second event is generally multiple days or longer…

Happy to provide any other detail that might help us get to the bottom of it. Thanks again!

Hi @tfinkel,

Thanks! I’ll have think about other fields we could look at to further investigate this.

We received this question from one of our users:

How can I ensure that I kill the minimum number of real users when de-duplicating with the techniques mentioned in your post?

The answer might benefit other people as well, so I’m posting it here.

I would not start with running the deduplication queries. Instead, I would first make sure all queries join on both the event ID and collector timestamp (as outlined in the post). This is in almost all cases a good enough solution, because a lot of these events do have a different collector timestamp (e.g. a bot will still send events spread out over time).

I’d then assess whether this is still an issue by looking at the % impact before and after. If it’s much less than 1%, then it’s probably good enough (as it won’t skew the results).

If not, I’d run the deduplication queries once, but in 2 stages. First without step D. This will remove all duplicates that are identical. We now stop most of them in Hadoop, but there will be more of these in the data from 2015 (before we released this). Note that events are never deleted, only moved to a different schema (just in case).

I’d then assess the issue again, and if it’s still significant, I’d run it with step D. This could remove some events from real users, and I don’t know of a reliable way to separate them from other traffic – which is why is leave this as the last option (which hopefully won’t be necessary).

I’m a little confused by the unstructured event example. I can see where it would remove all duplicates from the source atomic table, preserving the originals there, but as written I can’t see how it would write all of the dupes to the corresponding duplicates table. Line 86 executes the exact same subquery to retrieve the dupes as Line 42, but only after the dupes have already been deleted from the source table, so I’d expect this result to be zero. If I run step (d) in my SQL Runner scripts, am I meant to skip steps a-c? Then I’d end up with all originals+dupes in the duplicates table and none in the source.

It almost seems as if the events example, with its window function partitioning by root_id, ordered by root_tstamp, and returning the event_number, would be more effective here. Then I could similarly write event_number = 1 back to the source table and event_number > 1 to the duplicates table. Or am I misunderstanding the intent of the duplicates table?

Thank you.

the links to the SQL above result in a 404.

believe they are here now: