I’m finding that in my pipeline that I’m running as a demo, there are duplicate event ids consistently coming through into storage (lines completely identical). Not all the duplicated lines are appearing immediately, they tend to just keep duplicating over time in my storage location (BigQuery). Since all these records have the same etl timestamp, collector time, device time, event id, etc. I’m assuming that I’ve misconfigured something in my pipeline.
Has anyone come across this issue before? (just to see if this is an obvious issue before I look into each step separately).
My pipeline looks like:
js tracker->scala collector (single instance)->beam enrich->dockerized bigquery loader (single instance)
Possible cause is that the acknowledgement deadline on the ‘good’ pubsub topic was too high. Testing this now.
Think this was it…