Sending unstructured events + Schemas

Hi,

we are building a prototype for logging application errors with snowplow.
The error-data for this task i quite easy (about 5-8 fields).

I have set up the scala stream collector,the kinesis stream enrich process and i am sending events
with the PHP tracker (unstructured events): $tracker->trackUnstructEvent($event_json)

My experience with all this is quite small and i am still in the process of understanding the broader picture.
I have several question in that direction, maybe somebody cen help me here.

1. QUESTION 1
I understand that i have to define a schema when tracking my data.
The schema is neccesary for validating the data in the enrich process and so that the storage process
knows where/how to save the event data?

2. QUESTION 2
Is there an easy way to set up a schema to use for unstructured events without using an Iglu server etc.? I was hoping i can reference a schema directly instead of uploading it somewhere?

3. QUESTION 3
Is there a good tutorial which shows how to set up a tracker using a custom schema and storing it to elasticsearch or Redshift? Good documentation in that area is quite hard to find.

Would be very glad for some advace, thanks.

Hi @johnschmidt,

All good questions. I’ll try to quickly cover the important points:

1. QUESTION 1
I understand that i have to define a schema when tracking my data.
The schema is neccesary for validating the data in the enrich process and so that the storage process
knows where/how to save the event data?

Correct, the primary function of the schema is for validation. Regarding storage, it depends on your warehouse. A future version of Redshift is planned to auto-create everything it needs, but currently you’ll need to also create a table manually, and generate a jsonpaths file. Both can be done using the igluctl method static generate with the --with-json-paths flag.

For Snowflake or BigQuery, all of this is handled under the hood so the schema is all that’s required.

2. QUESTION 2
Is there an easy way to set up a schema to use for unstructured events without using an Iglu server etc.? I was hoping i can reference a schema directly instead of uploading it somewhere?

The short answer is no. The long answer is that it’s likely possible but a lot more trouble than it sounds. The path of least resistance is to use Iglu Server. This also ensures that what you build is forward compatible with better functionality once it’s released (eg. auto-creation of tables).

If you’re dead set against an Iglu Server instance, the outdated old way is to use an S3 bucket to host schemas. You’ll just need to make sure the Enrich component’s iglu resolver file to suit.

3. QUESTION 3
Is there a good tutorial which shows how to set up a tracker using a custom schema and storing it to elasticsearch or Redshift? Good documentation in that area is quite hard to find.

This is a fair observation - I don’t know if there’s a simple guide on this - we are working to improve the docs. Here’s a quick breakdown of the process - note that we recommend using Snowplow Mini as part of the testing process. It’s quick and easy to set up and provides a sandboxed testing environment (and actually, depending on what you need to do, your use case demo might be possible entirely within Mini).

  • Create a custom schema, validate using igluctl’s lint method
  • Upload to snowplow Mini using igluctl.
  • Send test events to Snowplow Mini (via an instance of the tracker you’re using in a local environment ideally) & iterate/fix issues.
  • Generate jsonpath and SQL files using igluctl’s static generate method --with-json-paths
  • Create Redshift table
  • Use igluctl’s static push method to upload schemas and jsonpath files to Iglu for the main pipeline
  • Send some test events & verify that all is well
  • Put tracking live.

Hope that’s helpful.

JSON schema is used to validate data indeed. However, it is not correct to say “so that the storage process knows where/how to save the event data”. If you mean Redshift, JSONPaths are used to instruct Redshift on how to load the self-describing events and contexts (which are in JSON format after being shred as opposed to TSV format which is what atomic data is).

The following post might clarify shredding and loading to Redshift further: https://github.com/snowplow/snowplow/wiki/StorageLoader#the-storageloader-role-in-etl-process.

There are quite a few tutorials in this forum that you can find. Here are some of them as well as wiki post to start with

Setting up Iglu Server is very easy when it comes to the static server (there are different types to choose from). It’s just storage on the web where the files (JSON schemas) are accessed via standard HTTP. If we take AWS, just place the files in the appropriate folders in the bucket and set it to be accessible on the internet.

The links I have already pointed out to could be a good starting point. As Colm mentioned, Snowplow Mini is a good tool to test your implementation of JSON schemas.

Thanks a lot for your input.
I managed to get a prototyp up and running with saving the data to elasticsearch
via the snowplow-es-http-loader.

One question i still have though.
We are logging error and routing events right now which we dont really need to enrich/validate.
I understand that it is good practice to do that but i was hoping that i can skip the enrichment process
right now to reduce AWS cost and process complexity.
Is it posible to skip the enrichment?
In my naitivity i just changed the in-stream in the es-loader from the enrich-stream to the good-stream from the collector. however this doesnt seem to work - i get exception then in the es-loader.

@johnschmidt, raw and enriched events have different formats. You can skip enrichment and read just raw data instead. However, you would need to build your own processing of such events. Raw events haven’t been meant to be the final product. You also won’t be able to load such data to Redshift without further transformation, which requires shredding that in turn also uses validation.

Ok understood, thanks!