Writing Iglu clients


#1

I’m new to Snowplow and just exploring this stack. I’m mainly interested in Iglu schema registry. I understand that can set it up as static site with a specific structure. However, My client stack is Python. Considering there is no Python iglu client, I have couple of question:

  1. Is a python iglu client on your roadmap? if so what is the timeline?
  2. If I were to write a client myself, is there any documentation I can follow of the features and design? Considering for a static repo, the REST calls are going to be the way to interact but it seems like the client does more than just call REST API for schema lookups. what about when registering a new schema, what validation has to happen? Is there any additional metadata stored anywhere?

Any guidance will be appreciated.


#2

Hello @atharvai,

You’re exactly right about static Iglu registry - there’s nothing more than just specific file structure.

  1. Unfortunately, we don’t have Iglu Python Client yet, and don’t have it on roadmap.
  2. Regarding documentation - there’s lot of documentation about Iglu concepts on wiki, also not long time ago we wrote Iglu Ruby Client, which I believe should be very illustrative for writing client in another dynamically-typed language. Besides you can always ask here if you have any questions about implementation.

However, can I ask you for what exact purposes you need an Iglu Python Client? Iglu Clients usually used independently from user’s tech stack and if for example you use Snowplow Python Tracker - you don’t need an Iglu Python Client, tracked events will be handled by our Spark job with embedded Scala Iglu Client.


#3

Anton,

Thanks for the clear reply. Right now, I’m starting to look at Snowplow Analytics pipeline as an alternative to DIY solution. It looks like snowplow platform is very modular and at least the Schema Registry (Iglu) can be deployed outside of the pipeline’s context. I’m evaluating whether it is feasible for me to deploy the registry without the rest of the pipeline and what the available interfaces/tools are. Schema registry is higher priority than rest of the pipeline.

So considering the standalone use of Iglu, I’d need some way of interacting with it from our stack. Python is the primary and currently only language we use for implementing our pipelines including pyspark on EMR. I would like to understand what the level of effort is with something like Iglu to write and maintain a homegrown client. And later if we choose to implement the Snowplow pipeline, what the effort is for integration.

I’m happy to chat offline if you wish.


#4

Depending on what features you are after from the Iglu repository it may be easiest using the static “codeless” repository that is just hosted on S3. You can very easily deploy this (and the other repositories for that matter) without being tied to any other part of the pipeline and use boto3 to interact with it.

That said - I’d strongly recommend deploying the Iglu schema repository in tandem with the other Snowplow components for either batch or real time. They are highly complementary not just from a technical perspective but also a philosophical perspective (in terms of strictly typed, structured data).


#5

Hey @atharvai,

I’m actually glad to hear you’re considering to use Iglu as separate component. Despite as @mike noticed Snowplow and Iglu are coupled - we believe Iglu has a future as very general-purpose technology and we already have few evidences people successfully used it even outside data/analytics applications, e.g. for documenting REST-services. So it’s more like Snowplow is dependent on Iglu.

I’d estimate Python Iglu Client as not very complex project. Most of heavy-lifting work is done by 3rd-party libraries: JSON Validation and HTTP client. Having this you’ll need to implement following parts:

  1. Core parts like Iglu URIs (iglu:com.acme/event/jsonschema/1-0-1) and SchemaVer (1-0-2). Parse, store and compare. These should be very descriptive classes, you can look for examples in Ruby Client’s code.
  2. Parsing and validating resolver configuration. This is trickiest part, as you’ll need to implement an embedded registry, you can look for example in Ruby or Scala clients.
  3. a. Extracting Iglu URI from self-describing JSON
    b. Making HTTP request for JSON Schema
    c. Validating data against fetched Schema.
  4. Caching (optional)
  5. Authentication (optional, doesn’t work with static registry)

This is simplest possible, but totally valid Iglu client’s features. Will be glad to help if you have any more questions.


#6

Mike, yup I understand there are added benefits with the rest of the pipeline components and I will be evaluating rest of the components afterwards.


#7

Anton, thanks, This is great help, I’ll let you know what I end up evaluating and any results.


#8

hey @atharvai did you end up writing a python client for iglu… I am at a point in my project where I need one …


#9

Hey I did not, we haven’t started using Snowplow yet thus the delay


#10

thanks