Data during the validation step isn't decoded correctly

Colm · January 14, 2019, 12:04pm

We’d like to be able to analyse the bad rows in some way - maybe with them in SQL say, in athena.

That makes sense. In that case take a look at this blog (if you’re running a recent-version RT pipeline) and/or this one (otherwise).

If it’s for purposes of fixing the errors I recommend taking the approach of counting rows per error message, and manually decoding (I use the atom text editor which has handy packages for decoding base64 and prettifying JSON, or you can use Snowflake Analytics’ Snowplow inspector chrome plugin which has a bad row decoder in there too) one or two of them, then fixing. If the error message is the same, then the same field is corrupt for the same reason so you just need to find the issue for one bad row in order to fix all of them.

Btw just for clarification should there be no bad data at all? Like is there a certain amount of ‘bad data rows’ in most production systems or should be have zero?

There are going to be bad rows which can be ignored due to random traffic hitting your collector - mostly bots. These will be recognisable from errors like vendor with version... or Querystring is empty... etc.

Aside from these ideally you have 0 bad rows but sometimes there are edge cases that are so low in numbers that people decide not to bother with them. I always lean on the side of having a very strict schema and being very robust in your tracking implementation - the idea is that validation forces you to ensure your data is high quality when it lands in DB. A more permissive schema increases the chance of corrupt data.

It’s not always possible to have 0 bad rows however - if you don’t have control over the environment you’re tracking in or if your SP tags are in third party sites then you’re probably going to have some unavoidable ones.

In that case it’s a judgement call - generally about finding the best balance between having as strict as possible a schema but not so strict that enough data fails validation to impact your usage of it.

Additionally it’s always a good idea to monitor bad rows - checking in regularly.

Hope that’s helpful!

Topic		Replies	Views
Stream Transformer Failing to Fetch Records from Kinesis Stream	4	708	August 19, 2022
Unable to receive Snowplow data into Elasticsearch Data store sources	14	3281	January 17, 2018
Rookie mistake? Great raw logs, weird characters in Kinesis stream Troubleshooting	2	1807	August 4, 2017
Data not processing in Stream Enrich Enrichment	0	1203	July 27, 2018
Kinesis Enrich Output Enrichment	2	743	September 14, 2022

Data during the validation step isn't decoded correctly

Related Topics