We’re new to snowplow and are very excited to dig into this project! Over the course of week we have:
- are running the scala stream collector within kubernetes and its publishing to a google cloud pub sub topic raw topic
- are running the big query mutator within kubernetes in listen mode against the types subscription
- the beam enrich dataflow is running in streaming mode and is processing the raw events and publishing on an enriched topic
- the big query loader dataflow is running and processing the enriched events and inserting into bigquery
Great! We have our event data in big query with relatively little pain. We would like to contribute back how to configure this to run within kubernetes but thats a different post.
Our next step within our snowplow adopiton is to tap into the enriched events stream within our applications but I feel we are missing a core concept of the etl process in that how do we determine what the well known fields are of an event?
The current sdks (scala, python) just seem to have a hard coded list of events that are loaded from the order of the tab delimited enriched event?
Can anyone recommend how to inflate the enriched event back into a structured message or point us in the right direction of where to start?