Setup: AWS infrastructure. After validation every event goes to two places - one copy in ElasticSearch to be queried by Kibana, another copy goes to S3 where a job does batch-processing and writes events into Redshift.
Application: Browser website
There are situations where there is an event with an event_id in Kibana, but not in Redshift. Redshift has another event that has exactly the same attributes but a different event_id.
All the timestamps (dvce_created_tstamp, dvce_sent_tstamp, collector_tstamp) in Kibana have a +2hr offset than what is seen in S3 and Redshift.
Redshift has a group of events, let’s say 7, all of which have the same attributes and the same dvce_created_tstamp but different dvce_sent_tstamps and therefore different collector_tstamps. In other words, event that is created once is sent multiple times and collected multiple times. But all the events have different event_ids. For these somewhat duplicates, Kibana has:
- for some users the last duplicate with the same event_id
- for others the last duplicate with a different event_id. This different event_id is nowhere to be found in Redshift
- and for other users an event sent and collected later than the last duplicate seen in Redshift, of course having a different event_id, nowhere to be seen in Redshift.
Although Kibana and Redshift (also S3) take events from the same validation pipeline, but Redshift (and S3) has many (somewhat) duplicates while Kibana does not. And Kibana has some events that Redshift (and S3) does not.