All good questions. I’ll try to quickly cover the important points:
1. QUESTION 1
I understand that i have to define a schema when tracking my data.
The schema is neccesary for validating the data in the enrich process and so that the storage process
knows where/how to save the event data?
Correct, the primary function of the schema is for validation. Regarding storage, it depends on your warehouse. A future version of Redshift is planned to auto-create everything it needs, but currently you’ll need to also create a table manually, and generate a jsonpaths file. Both can be done using the igluctl method
static generate with the
For Snowflake or BigQuery, all of this is handled under the hood so the schema is all that’s required.
2. QUESTION 2
Is there an easy way to set up a schema to use for unstructured events without using an Iglu server etc.? I was hoping i can reference a schema directly instead of uploading it somewhere?
The short answer is no. The long answer is that it’s likely possible but a lot more trouble than it sounds. The path of least resistance is to use Iglu Server. This also ensures that what you build is forward compatible with better functionality once it’s released (eg. auto-creation of tables).
If you’re dead set against an Iglu Server instance, the outdated old way is to use an S3 bucket to host schemas. You’ll just need to make sure the Enrich component’s iglu resolver file to suit.
3. QUESTION 3
Is there a good tutorial which shows how to set up a tracker using a custom schema and storing it to elasticsearch or Redshift? Good documentation in that area is quite hard to find.
This is a fair observation - I don’t know if there’s a simple guide on this - we are working to improve the docs. Here’s a quick breakdown of the process - note that we recommend using Snowplow Mini as part of the testing process. It’s quick and easy to set up and provides a sandboxed testing environment (and actually, depending on what you need to do, your use case demo might be possible entirely within Mini).
- Create a custom schema, validate using igluctl’s
- Upload to snowplow Mini using igluctl.
- Send test events to Snowplow Mini (via an instance of the tracker you’re using in a local environment ideally) & iterate/fix issues.
- Generate jsonpath and SQL files using igluctl’s
static generate method
- Create Redshift table
- Use igluctl’s
static push method to upload schemas and jsonpath files to Iglu for the main pipeline
- Send some test events & verify that all is well
- Put tracking live.
Hope that’s helpful.