Hello everyone,
I have an issue with ids for duplicate events after shredding step - they are mismatching between 2 bound tables (“atomic.events” and “atomic.custom_event” respectively).
Let me show some sample events to illustrate the issue.
There are following events sent from the tracker:
event_id, collector_tstamp
4da86136-f1f7-41c2-9084-fc15f8d1c579,2021-06-17 23:59:59.000
fff0bfb5-b95f-4dfe-92f5-374080d71ac0,2021-06-17 23:59:59.000
05aee854-828d-4dab-8b42-069e9d80c7f0,2021-06-17 23:59:59.000
a9b5c2b2-0556-4218-9437-11238eeca64f,2021-06-17 23:59:59.000
2aaa096d-d9df-4cf5-b82c-4dc63d2a7963,2021-06-18 00:00:01.000
fb920ec4-ad3e-41c4-a966-484938c9e6e8,2021-06-18 00:00:01.000
05aee854-828d-4dab-8b42-069e9d80c7f0,2021-06-18 00:00:01.000
a9b5c2b2-0556-4218-9437-11238eeca64f,2021-06-18 00:00:01.000
The difference between each duplicate is “dvce_sent_tstamp” field only, other fields are the same. Those events successfully enriched and sent to shredder.
By the way, we’re not using “Event fingerprint enrichment” in our enrichment step.
I expect shredded to detect duplicates and do some deduplication work, so there would be either 6 or 8 events (keep only first events, delete others or keep all events). However, I receive a set of 12 different identifiers.
There are 8 records in “atomic.events” table, 4 for unique records and 4 for duplicate records:
event_id, collector_tstamp
4da86136-f1f7-41c2-9084-fc15f8d1c579,2021-06-17 23:59:59.000000
fff0bfb5-b95f-4dfe-92f5-374080d71ac0,2021-06-17 23:59:59.000000
33726166-4356-4cfe-a02c-baf049a676f5,2021-06-17 23:59:59.000000
7bc53840-9268-4ea4-8dcd-20b3458252a8,2021-06-17 23:59:59.000000
2aaa096d-d9df-4cf5-b82c-4dc63d2a7963,2021-06-18 00:00:01.000000
fb920ec4-ad3e-41c4-a966-484938c9e6e8,2021-06-18 00:00:01.000000
b0062a23-515c-4357-93c2-c7c51b866364,2021-06-18 00:00:01.000000
fd5f2337-e7b8-401e-8526-62194d32178e,2021-06-18 00:00:01.000000
There are 8 records in “atomic.custom_event” table, 4 for unique records and 4 for duplicate records:
root_id, root_tstamp
4da86136-f1f7-41c2-9084-fc15f8d1c579,2021-06-17 23:59:59.000000
fff0bfb5-b95f-4dfe-92f5-374080d71ac0,2021-06-17 23:59:59.000000
2a448b51-a45f-4b80-8953-575b7f37a816,2021-06-17 23:59:59.000000
6e861595-d4ec-4c0a-a3e4-6bcca8488a9f,2021-06-17 23:59:59.000000
2aaa096d-d9df-4cf5-b82c-4dc63d2a7963,2021-06-18 00:00:01.000000
fb920ec4-ad3e-41c4-a966-484938c9e6e8,2021-06-18 00:00:01.000000
88e6bd45-10b5-4b07-b1ac-6074d7e0929b,2021-06-18 00:00:01.000000
fb563b1e-015c-47b2-a69e-b503b9bd1764,2021-06-18 00:00:01.000000
Also, there are 4 records in duplicates table, which points to correct events from the input (original_event_id):
root_id, root_stamp, original_event_id
16b72348-3414-4bbe-a900-b5ef41e9ec21,2021-06-17 23:59:59.000000,05aee854-828d-4dab-8b42-069e9d80c7f0
52257594-c68f-4da5-aef8-1bd1facfb438,2021-06-17 23:59:59.000000,a9b5c2b2-0556-4218-9437-11238eeca64f
d449aed8-9584-45e5-a1b7-8e583fd487c3,2021-06-18 00:00:01.000000,05aee854-828d-4dab-8b42-069e9d80c7f0
4145459f-112b-4672-bf82-3370e012195d,2021-06-18 00:00:01.000000,a9b5c2b2-0556-4218-9437-11238eeca64f
So, both tables has new records, but their ids are mismatching, so there is no way for me to do sql join. Even more, those ids are not matching to anything in “atomic.duplicates”.
Could someone, please, explain what’s happening during deduplication? And what can I do to really remove duplicates?