I am using snowplow collectors to write to a kinesis Sink and enrichers to kinesis pipeline. I see that for various reasons (n/w issues, latencies , to not lose any data) , the collectors/enrichers are retrying to write the events to the shards there by resulting in exact same duplicate events in the S3 folders and redshift tables. (same event_transation id, collector timestamp and record insert timestamp as well)
We know its from the collector end and not the client end after various checks. We also see the same events counts to be 3 most of the times and thats the # of retries the collectors are configured for by default.
Is there a place and way to remove these duplicates in stream or in the enrichers before writing to S3 ? . We don’t want to handle it at the table level as it is going to be expensive at our end. Any suggestions would be really helpful.