Loading enriched events into IBM dashDB


#1

As part of our drive to make the Snowplow community more collaborative and widen our network of open source contributors, we will be regularly posting our proposals for new features, apps and libraries under the Request for Comments section of our forum. You are welcome to post your own proposals too!

Currently Snowplow can load enriched events into the following storage targets:

  • Amazon S3
  • Amazon Redshift (batch pipeline only)
  • Elastic (real-time pipeline only)
  • Postgres (real-time pipeline only; no shredding)

This is an RFC for adding support for loading Snowplow enriched events into IBM Bluemix’s dashDB dashdb database. This work is a collaboration with Snowplow partner (and IBM partner) [Ironside Group] ironside.

This RFC should act as a template for developers considering extending Snowplow support to other relational databases such as Azure SQL Data Warehouse or Vertica.

IBM dashDB

dashDB is a hosted database available as part of IBM’s Bluemix platform. It is based on IBM DB2, and can be used for transactional or analytical workloads. Here is the overall architecture, courtesy of Ironside Group:

In terms of our existing storage targets, dashDB is most similar to Amazon Redshift. Here is a broad feature comparison with Redshift:

Proposed integration

There are two phases to this integration:

  1. Support loading just the enriched event (atomic.events in Redshift parlance) into dashDB
  2. Support shredding all self-describing events and contexts into dedicated tables in dashDB

The second phase, shredding, is significantly more complex than the first phase - we will break this down into sub-phases below. But first, let’s look at loading just the enriched events.

Phase 1. Loading enriched events

Today, we load the atomic.events table in Redshift from a TSV file format which is generated by our [Scala Hadoop Shred] scala-hadoop-shred job running on EMR.

We are in luck - testing has revealed that this TSV file format is compatible with dashDB, meaning that we can simply load the output of Scala Hadoop Shred into dashDB.

Therefore, the steps for loading enriched events into dashDB are as follows:

  1. Port our Redshift DDL for atomic.events to dashDB’s DDL syntax (#2404)
  2. Write a Hadoop job, Hadoop dashDB Loader, to take the output of Scala Hadoop Shred on HDFS and upload it to dashDB (#2952)
  3. Update [EmrEtlRunner] emr-etl-runner to add the Hadoop dashDB Loader to the EMR jobflow for each dashDB storage target in our config.yml (#2405)

In fact Ironside Group’s Steven Hu has already provided an initial PR (#2908) for this first phase of work.

Phase 2. Shredding all self-describing JSONs

The goal here is that all self-describing events and contexts in Snowplow enriched events can be “shredded” into dedicated tables in dashDB.

This phase is more complex, especially as it coincides with ongoing work to improve and restructure Iglu and our existing Redshift shredding technology.

However, the broad steps are as follows:

  1. Add support for dashDB DDL syntax into Iglu’s [Schema DDL] schema-ddl library
  2. Add the ability to generate dashDB artifacts for an Iglu schema registry via Igluctl igluctl
  3. Add dashDB artifacts for all schemas in [Iglu Central] iglu-central
  4. Extend Scala Hadoop Shred to support dashDB
  5. Extend the Hadoop dashDB Loader to support loading shredded entities into dedicated dashDB tables

We will cover each of these in the following sub-sections.

Please note that in the spirit of encouraging discussion and debate, we have not created tickets for the following work steps yet.

2.1 Adding dashDB support to Schema DDL

As part of our Iglu project we have a Scala library called [Schema DDL] schema-ddl which models Redshift DDL (CREATE TABLE, ALTER TABLE etc) as a Scala AST. This library is used to generate Redshift DDL for the tables corresponding to a Snowplow user’s self-describing events and custom contexts.

We need to extend Schema DDL to support the dashDB DDL syntax.

2.2 Adding dashDB support to Igluctl

Today, the Iglu CLI tool Igluctl is the primary way that Snowplow users prepare their Iglu schema registries. Provided with a set of JSON Schemas, Igluctl can generate corresponding Redshift DDL for these schemas, plus JSON Paths files to map the schemas to the tables.

We need to update Igluctl so that it can generate corresponding dashDB DDL and JSON Paths files for a set of JSON Schemas.

These new artifacts will be used to load Snowplow self-describing events and contexts into dashDB.

2.3 Adding dashDB artifacts to Iglu Central

In the same way that Igluctl will be used by Snowplow users to generate dashDB artifacts for their own JSON Schemas, we will pre-generate equivalent artifacts for all public JSON Schemas in Iglu Central.

This will allow any Snowplow user to load any of our standard events and contexts into dashDB.

2.4 Extending Scala Hadoop Shred to support dashDB

Scala Hadoop Shred is a Hadoop job (written in Scalding) which takes Snowplow enriched events and “shreds” out all of the self-describing JSONs (an event plus contexts) into a custom folder structure, ready for StorageLoader to load.

At the moment Scala Hadoop Shred is completely orientated around Redshift - no other database is supported. As part of this work step, we will update Scala Hadoop Shred so that it can output dashDB-compatible shredded events and contexts.

2.5 Extending Hadoop dashDB Loader to load shredded entities

After Phase 1, we will have a Hadoop dashDB Loader which can load only enriched events into dashDB.

As the final step in Phase 2, we will extend Hadoop dashDB Loader to also load the self-describing events and contexts into dashDB.

An important design point is that dashDB does not support the powerful COPY from JSON feature of Redshift, and so the use of JSON Paths files to map the shredded JSONs onto flat table structures will have to be moved upstream, into Scala Hadoop Shred. This is a move we are planning for loading Redshift anyway.

With this final work step complete, dashDB will be fully integrated into Snowplow - noting certain outstanding design questions, below.

Outstanding design questions

Column encodings

With Redshift we have iterated the column encodings of our tables over several years, with the aim of minimising storage and query costs. We have had fantastic assistance from the Snowplow community in this.

With dashDB, we will want to quickly come to a view on the best compression and column encoding options for Snowplow.

Indexes and keys

With Redshift we are careful to use the same dist key and sort keys between our parent enriched events table and our shredded child tables (contexts etc). This ensures the fastest possible performance when JOINing these back together.

We will need to investigate and find equivalently performant indexing and key options for dashDB. We will likely only start this during Phase 2 of the work.

Conclusion

This RFC sets out our joint proposal with Ironside Group for adding full support for IBM’s dashDB to the Snowplow batch pipeline.

Although the first phase of work is nearing completion, much of this integration is still to be designed, and we would welcome any and all feedback on any aspect of this plan.

Equally, if you have an interest in using dashDB with Snowplow, do please let us know in the comments! This will be helpful for our planning.

REQUEST FOR COMMENTS

As always, please do share comments, feedback, criticisms, alternative ideas in the thread below.


#2

@alex, I have a concern about the point 2.5. If we are to move the JSONpaths mapping upstream (into Scala Hadoop Shred) what would be the recovery steps in the case where the JSONPaths were incorrect (caused by, say, event/context update)?

I suppose that the shredding step will fail due to validation error (JSONPaths/JSON schema mismatch). Will that implies rerunning the pipeline job from EMR (enrichment step) or there will be some kind of easier way to rerun from shredding instead (post JSONPaths fix)? As per current workflow, this mismatch would result in failure at data load (not during EMR).


#3

Great question @ihor. As part of moving JSON Paths upstream, we will be auto-generating those JSON Paths files, instead of expecting them to be client-generated and client-maintained. So this process should become much less fragile.

But you are right - we need to make recovery from the shredding step much easier. You shouldn’t have to delete and re-run if Hadoop Shred fails for any reason (and it will inevitably fail more frequently when it has more complex work to do.) There is a ticket for this here: