Custom schema property types/sizes

Hi all, question on creating custom json schemas. I admit, I am new to this and haven’t gotten into everything yet.

When setting up a custom schema and assigning a type to the property, what size is an integer? Wondering if they are 32bit or 64bit integers?

When reading the docs behind the link to JSON schema, it listed array, boolean, integer, number, null, object, and string as possible types. In other things I have read with Snowplow they talk about GUID and other types as well. Is there a complete list of supported types and what size they correspond to?

Thanks for any help!
Bruce

Hi @Bruce.Arp,

We follow the standard specification for JSON schema. The main requirements however is it has to be a self-describing one. Below are the list articles you might find useful while getting your head around this topic.

The value for a given type is platform dependent. You can set the limits of values you expect and will be able to store in your preferred storage. Here’s a couple of examples for clarity:

"LargeInt" : {
	"type" : ["integer", "null"],
	"maximum" : 9223372036854775807,
	"minimum" : 0
}
"String": {
	"type": "string",
	"maxLength": 500
}

Once you get familiar with the topic, you are welcome to ask specific questions related to your application/scenario. We’ll be happy to help.

Regards,
Ihor

Thanks Ihor, time to start reading and learning!

Bruce

Hello @Bruce.Arp,

Just to follow-up for great @Ihor’s summary about JSON Schema I can recommend you to have a look at our project Schema Guru, dedicated to derive JSON Schema out of set of JSON instances.

Basically, Schema Guru can provide you some sane defaults for things like integer size, string length etc, based only on your set of JSONs, at the same time notably decreasing chance to make a mistake and alleviating manual labor. Also, when you’ll be ready to load your events into Redshift you can use Schema-to-DDL generator in Schema Guru (this functionality will be moved out into another project soon).

But overall, it is very important to understand basics of JSON Schema, so above links are still highly recommended.

Cheers,
Anton

Hi @ihor,
I have additional question related to this topic:
Are those properties (minimum and maximum for numeric type and maxLength for string) are required or we can omit them?

Technically they aren’t required but it’s generally a good idea to include them for validation and DDL purposes.

Specifying no maxLength will result in a 4096 VARCHAR (for Redshift / Snowflake) and I believe an integer without a maximum will default to BIGINT. If you’re using BigQuery where you can’t define these constraints it matter less. If you aren’t using BigQuery however these options will have a sizeable impact on speed and disk used as selecting an appropriate data type will quite often get you better compression and reduce the number of bytes on disk for a given block.

Hi Mike,
Thanks for your reply it is really helpful.