POSTed bad events, are they all dropped?


#1

Looking to reprocess some bad events, and we are sending in batches of 50 from the collectors through a POST.

When there is 1 bad event in the batch, is the whole batch dropped?

I can see in the bad events the line includes them all, invalid or not, but just not sure if the ones that were good were actually passed on or not.

Cheers


Bad event clubbed together with good events in bad s3 bucket
#2

It’ll only be the bad events in a payload that get dropped. If the payload that arrives at the collector contains 49 good and 1 bad the good should flow through the entire pipeline as normal.

If you’re running event recovery one thing to note is that as you’re reprocessing these original ‘raw’ payloads that contain a mix of good and bad you may end up with duplicate events (there’s a note at the end of the page on this in the caveats section).


#3

Thanks Mike, that’s what I suspected.

Currently the error information we see isn’t really targeted towards payloads with multiple events, as there is no idea which event in the payload is the bad one, or even if it is the event payload or a context.

Is the current recommendation to run every bad event through a Jason scheme validation for its payload, event body and contexts to find where the issue is?

Also what would happen if a collector payload had errors in multiple events? Would that end up as a single bad row in the bad folder, or would it have an entry for each bad event?

Cheers


#4

There would be an entry for each bad event.

That’s a fair point - we have some plans to evolve this in the future:

https://github.com/snowplow/snowplow/issues/2438


#5

Thanks Alex!

So for now when reprocessing bad rows when using a POST with multiple events we need to be super careful.

Would having a process where we look at the run identifier, and pull out all event ids that successfully made it through to the destination and then excluding those from the bad process be a good idea?

We would likely run multiple bad processes over the top fixing different issues, so we would need to be able to pass multiple runs.

Or do you have any other suggestions on how to process bad rows that are in this state? For example we have a line that has 50 events, all 50 are bad, so now we actually have 50 bad lines that are exactly the same, so we have no way to easily break them apart.

Cheers,
Dean


#6

No worries Dean,

What’s the ultimate target of the data - is it Redshift or Postgres? If it is, you can potentially lean on the RDB Shredder to do the dedupe for you “for free”, because with cross-batch dedupe enabled, it won’t load events that it has already processed.

That’s exactly what the DynamoDB-powered cross-batch dedupe in RDB Shredder does :spiral_notepad: