Schema creation


#1

Hi, I am a product manager new to Snowplow analytics. I am working on creating schema(s) for both web and mobile instances, and would like to know:

  1. What are the steps creating schema? Do I have to start with a list of unstructured events and custom contexts?
  2. Is it possible to just have one schema for web and one schema for mobile? I noticed versioning is mentioned in Alex’s self-describing JSON article, so I’m curious if one universal schema is even possible.

Thanks! Evelyn


#2

Good Evening,

One universal Schema is possible and I have implemented it at my company. It all depends on what you are going to use it for and who is going to use it. We need it all the rows to be similar so that we can put it into a series of redshift tables and it can connect to our BI platforms and be ready for analysis by anyone who wants it, whilst containing all of the information we need.

If you want to do more then events; such as page/screen views, I advise using the out of the box events as they are validated in the front end instead of at enrichment stage making them much easier to test.

I hope this helps


#3
  1. It’s worth reading some of the documentation and blog posts around self-describing schemas. This will give you a good background as to how and why schemas are used, and the tools that are involved such as igluctl and schema guru.

  2. Although it’s possible to have a single schema for various events it’s generally recommended that schemas are specific to certain events or classes of events. You can find a large number of the schemas Snowplow has set up to cover a wide range of use cases here including many for generalised use cases - such as ad_impressions as well as 3rd party events that may be triggered by webhooks such as email clicks or opens.

The other advantage reason to using specific events is that you’ll end up with more performant analysis come query time:

  1. Each event has it’s own table so having one large event table reduces the efficiency of joins with atomic.events
  2. There’s a risk of ending up with a single large “fat” event table and incurring the storage/performance costs of having to query multiple columns at once
  3. Schemas are typed (and compressed) to the data you’re expecting - having only a single schema (with limited parameters) may feel like putting a square peg into a round hole. This is a common problem with other analytics systems such as Adobe and Google Analytics where you have a fixed number of parameters in which you can send data - a flexibility vs structural trade off.

#4

Thanks! Clickstream tracking (page or screen views, and actions on page or screens) is going to be our main focus. We also have web and mobile applications that require custom contexts. Our goal is to share enterprise-wide schema with everyone but keep the governance centralized.