Self-describing events without "real" events?

We’ve been using Snowplow self-describing events for a long time and only recently we noticed that some of these self-describing events do not have any “real” counterparts in the events table (joining on root_id = event_id returns nothing).

The beginning of these occurrences exactly matches the date when we upgraded to Snowplow R97 (also, changed Clojure collector to Scala collector + Kinesis streaming). The amount of bad events is relatively small compared to normal ones (below 1%) so we cannot really see the lack of anything.

How can it be? Events come in single lines and thus should be held together until shredding and loading, how can they go missing?

Thanks!

I’d have a look at a few events that you are missing and work backwards from there.

  • Do you have MAXERROR set for any of the Redshift loads?
  • What does the data look like when it’s been shredded on S3?
  • What does the data look like in enriched format on S3?
  • What does the data look like in raw on S3?

That should hopefully yield some useful information about what to dig into next in order to determine why the data isn’t there.

Hey @mike - thanks for your reply.

Our MAXERROR=1 so that shouldn’t be a problem, right?

Regarding shredded and enriched data, how would you suggest to look into it conveniently?
If I open some of the archived files manually, the chances of randomly spotting an astray ID (if it even exists) are very slim.

@pranas You can use Athena to query the enriched events:

We also have a guide on using Athena to query the shredded events, but that comes with some limitations:

EDIT:

You can use Athena to query the good bucket, which contains the enriched (but not yet shredded!) events.

Find your ‘orphan’ self-describing events in Redshift and see if their root_id matches any event_id in the enriched_events table in Athena.

MAXERROR shouldn’t be an issue. As @dilyan has mentioned above you can use Athena (or the S3 Select API) to query for a single row. Make sure you’re first filtering to the etl_tstamp associated with the event_id as that will significantly reduce the amount of data you need to scan.

Hello @dilyan and @mike - thanks for your suggestions and help.

I was not able to locate any ‘orphan’ root_id matching values neither in the shredded nor in the enriched event files.

I still have to look into raw lines but now it seems unlikely that I’ll find anything there.

I wish to understand what is the underlying mechanism which would alter root_id values. Some internal deduplication maybe? But in that case they would still have to appear in shredded files, right?

I wish to clarify my findings:

  • The enriched entries do not contain any ‘orphan’ root_id values.
  • The shredded entries in atomic-events folder do not contain any ‘orphan’ root_id values.
  • The shredded entries in the self-describing event folder do contain ‘orphan’ root_id values!

This, at least, brings some sanity to the situation. It appears that something goes astray when the enriched data is being shredded - namely some new ID’s are introduced out of the blue.

As far as I know the only way for an event_id to change after the fact is during the synthetic deduplication process in which an event have a new event_id generated (where events have the same event_id but differing event_fingerprints).

If this is the case a duplicate context should have been attached to the event. The event_id of this event should be the newly generated event_id and the originalEventId should contain the event_id before the random generation. More on this here.

2 Likes

Thanks @mike - I looked into the duplicate context and here are the findings.

  • None of the orphan self-describing events match entries in the duplicate context ON custom_event.root_id = original_event_id.
  • Some of the orphan events match the duplicate context USING(root_id, root_tstamp).

I’d like to concentrate on the latter, and call those ‘orphan duplicate contexts’.

  • None of the orphan duplicate contexts match normal events ON original_event_id = event_id AND root_tstamp = collector_tstamp.
  • All of the orphan duplicate contexts match normal events ON original_event_id = event_id only. However, these are old (older than our orphan problem) and look like bots a lot of the time.

The conclusion is that at least some (over 40%, to be precise) orphan self-describing events have something to do with Snowplow’s synthetic deduplication. Unfortunately, their event counterparts (whose event_id match either root_id or original_event_id and an exact time) are not found.

Any ideas?

P.S. Also attaching a hand-drawn diagram to illustrate the described situation.

We are still struggling with this.

It definitely happens only with the newer version of the system (R97 and lambda architecture).

Hey @pranas! We’re sorry it took so long, but at last we fixed this issue in RDB Loader R31 Snowplow RDB Loader R31 released. The problem was in synthetic deduplication, you can find more information in the corresponding blog post.