Data during the validation step isn't decoded correctly

We’d like to be able to analyse the bad rows in some way - maybe with them in SQL say, in athena.

That makes sense. In that case take a look at this blog (if you’re running a recent-version RT pipeline) and/or this one (otherwise).

If it’s for purposes of fixing the errors I recommend taking the approach of counting rows per error message, and manually decoding (I use the atom text editor which has handy packages for decoding base64 and prettifying JSON, or you can use Snowflake Analytics’ Snowplow inspector chrome plugin which has a bad row decoder in there too) one or two of them, then fixing. If the error message is the same, then the same field is corrupt for the same reason so you just need to find the issue for one bad row in order to fix all of them.

Btw just for clarification should there be no bad data at all? Like is there a certain amount of ‘bad data rows’ in most production systems or should be have zero?

There are going to be bad rows which can be ignored due to random traffic hitting your collector - mostly bots. These will be recognisable from errors like vendor with version... or Querystring is empty... etc.

Aside from these ideally you have 0 bad rows but sometimes there are edge cases that are so low in numbers that people decide not to bother with them. I always lean on the side of having a very strict schema and being very robust in your tracking implementation - the idea is that validation forces you to ensure your data is high quality when it lands in DB. A more permissive schema increases the chance of corrupt data.

It’s not always possible to have 0 bad rows however - if you don’t have control over the environment you’re tracking in or if your SP tags are in third party sites then you’re probably going to have some unavoidable ones.

In that case it’s a judgement call - generally about finding the best balance between having as strict as possible a schema but not so strict that enough data fails validation to impact your usage of it.

Additionally it’s always a good idea to monitor bad rows - checking in regularly.

Hope that’s helpful!

1 Like