Kibana and Redshift mismatch

Setup: AWS infrastructure. After validation every event goes to two places - one copy in ElasticSearch to be queried by Kibana, another copy goes to S3 where a job does batch-processing and writes events into Redshift.

Application: Browser website

There are situations where there is an event with an event_id in Kibana, but not in Redshift. Redshift has another event that has exactly the same attributes but a different event_id.

All the timestamps (dvce_created_tstamp, dvce_sent_tstamp, collector_tstamp) in Kibana have a +2hr offset than what is seen in S3 and Redshift.

Redshift has a group of events, let’s say 7, all of which have the same attributes and the same dvce_created_tstamp but different dvce_sent_tstamps and therefore different collector_tstamps. In other words, event that is created once is sent multiple times and collected multiple times. But all the events have different event_ids. For these somewhat duplicates, Kibana has:

  • for some users the last duplicate with the same event_id
  • for others the last duplicate with a different event_id. This different event_id is nowhere to be found in Redshift
  • and for other users an event sent and collected later than the last duplicate seen in Redshift, of course having a different event_id, nowhere to be seen in Redshift.

Although Kibana and Redshift (also S3) take events from the same validation pipeline, but Redshift (and S3) has many (somewhat) duplicates while Kibana does not. And Kibana has some events that Redshift (and S3) does not.

Hi @Hasan_Shaukat,

Are you using the SP shredder?

Its entirely possible to get duplicates forgoing the shredder. Its outlined in the posted link why this could occur. As for time stamps, what does the derived timestamp say?

Kind regards,

Hi @kfitzpatrick,

RDB Shredder is being used, although the version is 0.15.0.jar, which could be updated, but event_fingerprint was never enabled. Only the user_fingerprint is enabled. Could this be a valid cause of faulty deduplication?

And the derived_tstamp also has an offset of 2 hrs.