Hi, I’m trying to load snowplow data into Spark and build some analytical jobs.
The readme for the Scala SDK suggests to load data like this:
import com.snowplowanalytics.snowplow.analytics.scalasdk.json.EventTransformer val events = input .map(line => EventTransformer.transform(line)) .filter(_.isSuccess) .flatMap(_.toOption) val dataframe = ctx.read.json(events)
On first site this works great - the data has been loaded with just a few lines of code, and the schema is known!
However, as far as I can see there are two drawbacks to this approach:
- All json needs to be scanned to infer the schema - this is not the most performant
- The inferred schema will depend on whatever attributes, contexts etc happen to be present in the dataset
Point 2 becomes a problem if you try to access data from on of the contexts - sometimes the dataset does not contain that context, and therefore the schema is not inferred, and the field is not known, and the Spark job will fail.
Has anyone experienced this same issue, and how did you solve it? My feeling is that the context should be parsed using a known schema somehow.