Wrongly formatted data is not going to bad bucket


#1

We have been tracking some custom events. In one of our custom event we have a field of datatype date. So we have added “format” as date. Something like this

"batch_creation_date": {
    "type": "string",
    "format": "date"
}

But when the above has some string in it. It doesn’t throw validation error. e.g.

{“schema”:{“vendor”:“com.company”,“name”:“some_custom_event”,“format”:“jsonschema”,“version”:“1-0-0”},“data”:{“hash_id”:“bFNINTE2MzcxNjR8QTE1OXwyMDE2LTEyLTA4fE58Nzg1ODQ5NTA3NDQ4Y2NkNC4zMjg4NzI2Nw==”,“receiver”:“human”,“email_domain”:“gmail.com”,“receiver_address”:“7e2083b3d091ad7edbfeaa51a1302d3f”,“content”:“Merry Christmas”,“sender”:“system”,“event_action”:“open”,“batch_creation_date”:"<a class=",“campaign_id”:“A159”},“hierarchy”:{“rootId”:“b218d67a-df3c-46bb-a600-644ae1a6deb9”,“rootTstamp”:“2016-12-08 18:42:30.000”,“refRoot”:“events”,“refTree”:[“events”,“some_custom_event”],“refParent”:“events”}}

Now the above JSON should occur in validation error because date field having string. It does so in online schema validators. But over here it is not send to bad bucket and hence whole storage loader is failed while loading data into Redshift.

Can anyone point the cause of it ?


#3

Hello @jimy2004king,

In JSON Schema v4 spec there’s no date format, only date-time. Also specification defines behavior for unknown formats - it just ignores it in case someone extended validator and added custom format.

date exists in JSON Schema v3 which we don’t use, but I guess some online validator that you tried used v3 or added date as one of their custom formats.


#4

@jimy2004king, also we have tool called igluctl, which supposed to handle this kind of mistakes - but it doesn’t handle this particular mistake, I created a ticket to address this. But still you may want to use it to find other subtle errors in your schema.


#5

@anton Thanks for pointing out that. But is it sure that we use Schema v3 and not schema v4. Because in the above example, it fails for v4 and passes v3. Here is the gist - http://jsonschemalint.com/#/version/draft-04/markup/json?gist=6265d3160e6dff1c6cca5ea6c23f325b


#6

@jimy2004king, yes that is weird decision for jsonschemalint.com and very unfortunate coincidence for us.

jsonschemalint.com use different validators for different versions. AJV for v4 and JSV for v3.
JSV is slow and buggy. It validates your instance against v3 just because it seems that is doesn’t care about format at all: notice for example format: uri in this snippet.
AJV is very flexible and has lots of features, but those features don’t have much in common with specification which I linked before.

As I said this is very unfortunate coincidence, because we didn’t have such problems with online validators before (I think jsonschemalint.com changed its validator core very recently). We at Snowplow use this validator (previously known as fge’s) at enrichment, shredding and all other places. You can use its online version to be sure which JSONs will pass validation and which will not. However, I found important differences with online version and our apps: 1) it doesn’t understand self-describing JSON’s self property 2) it marks unknown formats as errors.

Once again, we didn’t have these problems before and used to rely on online validators, however it seems that we need to reconsider their usage. So, thanks for raising.


Best way to validate custom context data?