Code Generation from Iglu Schemas


#1

Hi All,

Has anyone ever investigated generating client side code/classes based on the iglu schemas? We use our schemas across a variety of clients (python, Java, Swift), and it would be very convenient to generate type safe classes for our events and contexts, instead of using raw json objects.

It looks like there are a few libraries out there that support code generation from json-schemas (which iglu uses), but they are all for one specific language: http://json-schema.org/implementations.html#data-parsing

Thanks,
Russell


#2

It’s not code generation as such… But we’re looking at shortly releasing a Java SDK for Snowplow that will go with this nicely, so would definitely be interested in any developments in this area.


#3

@acgray any thoughts on the most useful form of this generator? A command line tool, a gradle plugin, a ruby gem, etc.? We are considering writing this generator ourselves and open-sourcing it, and using templates for each supported language (much like swagger-codegen: https://github.com/swagger-api/swagger-codegen/tree/master/modules/swagger-codegen/src/main/resources). That would allow other users to contribute other languages.


#4

This is a really cool idea @rmelick-vida! Keep us posted how you get on…


#5

Hey @rmelick-vida,

We definitely consider something like this in future, though not sure yet how should we approach this problem. Here’s very first drafts: https://github.com/snowplow/iglu/issues/88

One particularly challenging problem in code-generation is a SchemaVer and proper compatibility between classes.

  1. name and model are part of class name (or namespaces other way), (e.g. SomeEntity_1) to reflect two distinct models of single schema can actually be absoutely different and incompatible.
  2. How should we model revisions? E.g. some property had type string in 1-0-0, but became [string, integer] in 1-1-0. These types of changes are unfortunately always most challenging in Iglu design decisions.
  3. Should we always generate classes only for latest revision:addition?

#6

My java brain says a Maven/Gradle/etc plugin would be great, but the conventional pattern for these kinds of tools (Thrift, Avro, etc) is to have the original schema in the same codebase (in somewhere like src/main/thrift for example, which obviously contradicts Iglu’s principle of a central, remote repository.

For me the ideal workflow, using java as an example, would be update iglu schemas -> generate code -> package a library -> import library in dependent code. So a command line tool like protobuf’s protoc would make the most sense.

Could this eventually be part of igluctl? That seems like the most natural place for it to live.


#7

Hey @acgray,

For me the ideal workflow, using java as an example, would be update iglu schemas -> generate code -> package a library -> import library in dependent code.

That’s one approach - but remember that the schemas already exist in the registry, so the registry has enough information to build and host libraries that provide those schemas in a given language (think Maven per Anton’s link or PyPI). In other words, if the registry does the heavy lifting, then your consuming app just needs to pull in a library dependency…


#8

That’s exactly what I mean - I’d envisaged doing this code generation process happening as part of, for example, a CI job to validate schemas and push them to an S3 or Scala registry. All the more reason for this to reside in igluctl, as that’s where those processes already happen.


#9

@alex I feel like @acgray’s idea to host the packages outside of the iglu registry is probably a good one. That way, we don’t need to implement all of the different repository protocols (maven, pypi, npm, etc.) inside of the registry. Instead, the user could use whatever tool they normally host their dependencies in (like maven central or artifactory).

@anton I’ve also thought about those same questions. Here’s what I was thinking (in a java example):

  • The vendor is used as the package name
  • The name is the first part of the class name
  • The major version number is the last part of the class name

Having the major version number in the class name would let us make backwards incompatible changes (like making a string field into a union of other types) without breaking the client code. I think that would have to be a major version increase (1-0-0 to 2-0-0) instead of a minor increase though. I suppose we could also put the entire version number into the class name, but that would require a lot of work on the client side as schemas evolved.

So, for an example schema (https://github.com/snowplow/iglu-example-schema-registry/blob/master/schemas/com.example_company/example_event/jsonschema/1-0-0), I would think a class like the following

package com.example_company;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.time.OffsetDateTime;

public class example_event1 {
    private final String _exampleStringField;
    private final Long _exampleIntegerField;
    private final Double _exampleNumericField;
    private final OffsetDateTime _exampleTimestampField;

    public example_event1(@Nonnull String exampleStringField, 
                          @Nonnull Long exampleIntegerField, 
                          @Nullable Double exampleNumericField, 
                          @Nullable OffsetDateTime exampleTimestampField) {
        // TODO validation of property mins maxes, etc.
        _exampleStringField = exampleStringField;
        _exampleIntegerField = exampleIntegerField;
        _exampleNumericField = exampleNumericField;
        _exampleTimestampField = exampleTimestampField;
    }

    @Nonnull
    public String getExampleStringField() {
        return _exampleStringField;
    }

    @Nonnull
    public Long getExampleIntegerField() {
        return _exampleIntegerField;
    }

    @Nullable
    public Double getExampleNumericField() {
        return _exampleNumericField;
    }

    @Nullable
    public OffsetDateTime getExampleTimestampField() {
        return _exampleTimestampField;
    }
}


#10

FWIW our current approach in Java is to use Immutable value classes then depend on Immutables’ code generation to fill out the full class. We then wrap these value classes in a generic SelfDescribing<T> type which holds the Iglu schema reference. (This is part of the code we plan to release - stay tuned!) The neat thing about that is it can also be used with dynamic types like Gson’s JsonObject. That’s quite java-specific but the same approach could certainly be followed in other languages too.


#11

What about transpiling the JSON schemas into a protobuf and then using Protocol Buffers (protoc) to generate the code required?

The advantage of this approach is that you’d only need to write the transpiling logic once (probably in igluctl), the protobufs could live in version control as part of the CI/CD process and then you’d get Protobuf to generate the code for the required languages. You’d need to add a little bit of code in the trackers to serialize the object into JSON but I think Protobuf may already include some methods for doing this easily.


#12

Transpiling to Avro would work too - and might make more sense if Snowplow is moving to Avro anyway!

However, both Protobuf and Avro have far fewer features than jsonschema. Working out a way to convert without losing information would take some thought.


#13

I imagine Protobuf and the JSON schema would be complementary rather one replacing the other. You’re right in that JSON schema has quite a bit more functionality (particularly around validation). The trackers themselves only really need to know how to construct a valid payload and then it’s up to the enricher to validate that (based on a JSON schema) down the line.


#14

Following up on this, I’ve published an initial release of the aforementioned java library!