How to enrich events with MySQL data using the SQL Query Enrichment [tutorial]


#1

The SQL Query Enrichment lets you perform dimension widening on a Snowplow event via a SQL query against a JDBC-capable relational database.

This is an introductory tutorial to help you get started with this enrichment. Our example use case is deliberately simple but it should be enough to showcase the usage of the enrichment and give you an idea on how you could utilize it in real-world scenarios.

Business goal

Let’s assume that we have a web form containing an email address field. Updates to the form, including form submissions, all result in an event being sent to Snowplow.

We would like to enrich these events with some details related to the owner of the email address; you could say that we want to dimension widen the event using the email address as a key. We are in luck - our company operates an in-house CRM system which stores data in its own MySQL database.

Let’s start by understanding this data we want to join.

Part 1: Understanding our data for joining

Our company’s database has a table called users which contains records for almost all relevant email addresses. This is the query that was used to create the users table in MySQL:

CREATE TABLE users (
    id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    firstname VARCHAR(30) NOT NULL,
    lastname VARCHAR(30) NOT NULL,
    email VARCHAR(50) UNIQUE,
    reg_date TIMESTAMP
);

The SQL Query Enrichment is quite powerful - it can support retrieving data from a join across multiple tables. In our case, however, all the data we are after is contained within this single table - here’s a row:

SELECT firstname, lastname, email, reg_date
FROM users
WHERE email = 'pupkin@mail.ru';

Returning:

+-----------+----------+----------------+---------------------+
| firstname | lastname | email          | reg_date            |
+-----------+----------+----------------+---------------------+
| Ivan      | Pupkin   | pupkin@mail.ru | 2003-05-12 04:34:11 |
+-----------+----------+----------------+---------------------+
1 row in set (0.00 sec)

So, if our web form provides us with the user’s email address, we can use this email address as the key (email) to retrieve data from this MySQL users table, and attach this data to our events.

Part 2: Generating our source events

In an ideal world, perhaps our form submissions would fire a custom self-describing (unstructured) event, containing the email address in a dedicated field.

To keep things simple for this tutorial, however, let’s take advantage of existing Snowplow JavaScript Tracker functionality, specifically Form Tracking.

Unfortunately, we can’t use Form Tracking’s submit_form event, because the data obtained from all of the form’s fields are populated into a single value field. This is not something we can work with yet.

Instead, we’ll use the change_form event, which is fired for every changed form field and populates the value of the changed field in its value field, per the JSON schema for this event.

Note that if we had lots of fields in our form, we could trigger many fruitless lookups of non-email fields in our users table; for the purposes of this tutorial let’s assume our field consists only of our email field.

We enable the form tracking on the web page like so:

<script type="text/javascript">
window.snowplow('newTracker', 'clj', 'pweb-clj.us-west-2.elasticbeanstalk.com', { // Initialise a tracker
  appId: 'tom-tk'
});
window.snowplow('enableFormTracking');
</script>

This enables both submit_form and change_form event tracking. The change_form event will end up in its dedicated table in Redshift post processing, called com_snowplowanalytics_snowplow_change_form_1.

Great, we now know the data that we are looking up, and where our source events are coming from; next let’s put together a schema to store the data we are going to look up.

Part 3: Preparing our schema

Snowplow will pass change_form events to the SQL Query Enrichment to look up the email address in each event against our users table in MySQL. If a match is found, the matching record from MySQL will be converted into JSON and added as an additional context to the derived_contexts field of the enriched event.

For this to go smoothly, we need to define the JSON Schema for our MySQL record. Basing this closely on the underlying users MySQL table, here is our JSON Schema:

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Schema for a Properweb user",
    "self": {
        "vendor": "ca.properweb",
        "name": "user",
        "format": "jsonschema",
        "version": "1-0-0"
    },
    "type": "object",
    "properties": {
        "firstname": {
          "type": "string"
        },
        "lastname": {
          "type": "string"
        },
        "email": {
          "type": "string"
        },
        "reg_date": {
          "type": "string”,
                  "format": "date-time"
        }
    },
    "minProperties":1,
    "required": ["email"],
    "additionalProperties": false
}

In our Iglu schema registry for Snowplow, this would be stored as the file:

 schemas/ca.properweb/user/jsonschema/1-0-0

Now, we are going to use igluctl to generate the corresponding Redshift and JSON Paths file to allow us to load these contexts into Redshift.

We run the following command:

$ igluctl static generate --with-json-paths schemas/

On successful run, the output is like the one below:

File [/Users/ihor/workshop/tutorial/sql-enrichment/./sql/ca.properweb/user_1.sql] was written successfully!
File [/Users/ihor/workshop/tutorial/sql-enrichment/./jsonpaths/ca.properweb/user_1.json] was written successfully!

Let’s take a look at the generated files. First, here is the Redshift DDL to create the table for our looked-up users:

$ cat sql/ca.properweb/user_1.sql
-- AUTO-GENERATED BY igluctl DO NOT EDIT
-- Generator: igluctl 0.1.0
-- Generated: 2016-07-06 15:28

CREATE SCHEMA IF NOT EXISTS atomic;

CREATE TABLE IF NOT EXISTS atomic.ca_properweb_user_1 (
    "schema_vendor"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_name"    VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_format"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_version" VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "root_id"        CHAR(36)      ENCODE RAW       NOT NULL,
    "root_tstamp"    TIMESTAMP     ENCODE LZO       NOT NULL,
    "ref_root"       VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "ref_tree"       VARCHAR(1500) ENCODE RUNLENGTH NOT NULL,
    "ref_parent"     VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "email"          VARCHAR(4096) ENCODE LZO       NOT NULL,
    "firstname"      VARCHAR(4096) ENCODE LZO,
    "lastname"       VARCHAR(4096) ENCODE LZO,
    "reg_date"       TIMESTAMP     ENCODE LZO,
    FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)
)
DISTSTYLE KEY
DISTKEY (root_id)
SORTKEY (root_tstamp);

COMMENT ON TABLE atomic.ca_properweb_user_1 IS 'iglu:ca.properweb/user/jsonschema/1-0-0';

Next, the JSON Paths file which we’ll use to load records into the above table:

$ cat jsonpaths/ca.properweb/user_1.json
{
    "jsonpaths": [
        "$.schema.vendor",
        "$.schema.name",
        "$.schema.format",
        "$.schema.version",
        "$.hierarchy.rootId",
        "$.hierarchy.rootTstamp",
        "$.hierarchy.refRoot",
        "$.hierarchy.refTree",
        "$.hierarchy.refParent",
        "$.data.email",
        "$.data.firstname",
        "$.data.lastname",
        "$.data.reg_date"
    ]
}

Okay great - we have prepared all the artefacts needed to join records from our MySQL users table into our event stream.

We’re now ready to write the JSON configuration for our SQL Query Enrichment.

Part 4: Configuring our enrichment

Let’s call the configuration file sql_query_enrichment_config.json, and make sure it is available in the folder of enrichments passed into Snowplow.

The wiki documentation for the SQL Query Enrichment gives a good overview of the various fields required - I’ll jump straight to the full configuration for this tutorial and then explain it below:

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0",
  "data": {
    "name": "sql_query_enrichment_config",
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "enabled": true,
    "parameters": {
      "inputs": [
        {
          "placeholder": 1,
          "json": {
            "field": "unstruct_event",
            "schemaCriterion": "iglu:com.snowplowanalytics.snowplow/change_form/jsonschema/1-0-0",
            "jsonPath": "$.value"
          }
        }
      ],
      "database": {
        "mysql": {
          "host": "sqlenh.clk45ymgndgz.us-west-2.rds.amazonaws.com",
          "port": 3306,
          "sslMode": false,
          "username": "ihor",
          "password": "cherkas1",
          "database": "sqlenh"
        }
      },
      "query": {
        "sql": "SELECT firstname, lastname, email, reg_date FROM users WHERE email = ? LIMIT 1"
      },
      "output": {
        "expectedRows": "AT_MOST_ONE",
        "json": {
          "schema": "iglu:ca.properweb/user/jsonschema/1-0-0",
          "describes": "EVERY_ROW",
          "propertyNames": "AS_IS"
        }
      },
      "cache": {
        "size": 3000,
        "ttl": 60
      }
    }
  }
}

This file has to pass validation against the corresponding schema in Iglu Central. There are four sections to this configuration file:

  1. inputs specifying the datapoint from the Snowplow event to use as the key when performing the API lookup
  2. database defining how the enrichment can access our company’s MySQL database
  3. query defining how the enrichment should query our MySQL database
  4. output allowing us to tune how we convert the returned row(s) into one or more self-describing JSONs ready to be attached to our Snowplow event

Let’s look at each in turn.

input

Each field within a Snowplow enriched event is either a pojo if the datapoint comes from the Snowplow enriched event POJO, or json if the datapoint comes from a self-describing JSON inside one of the three JSON fields, namely contexts, unstruct_event or derived_contexts.

The SQL Query Enrichment can use any field from the enriched event can be used as a key to query the database, that is dimension widen with the SQL Query Enrichment.

In our scenario, the value field of the change_form event can be found in the unstruct_event field of the enriched event. We can express this in our configuration using the following coordinates:

"inputs": [
  {
    "placeholder": 1,
    "json": {
      "field": "unstruct_event",
      "schemaCriterion": "iglu:com.snowplowanalytics.snowplow/change_form/jsonschema/1-0-0",
      "jsonPath": "$.value"
    }
  }
]

What we are saying here is:

  • The type of the datapoint is json (as opposed to pojo)
  • The JSON is located in unstruct_event field of the Snowplow enriched event TSV file
  • The JSON is self-described by the JSON Schema located at iglu:com.snowplowanalytics.snowplow/change_form/jsonschema/1-0-0
  • The parameter we need can be accessed at the $.value JSON Path

The placeholder parameter will be explained later on.

database

This section is self-explanatory. Our database is MySQL; you could equally try this out with PostgreSQL or Microsoft SQL Server.

"database": {
      "mysql": {
        "host": "sqlenh.clk45ymgndgz.us-west-2.rds.amazonaws.com",
        "port": 3306,
        "sslMode": false,
        "username": {{USERNAME}},
        "password": {{PASSWORD}},
        "database": {{DATABASE}}
      }
    }

We strongly recommend that the username have minimal read-only permissions on just the entities required to execute the SQL query.

If your database server has additional authentication in place, such as IP whitelisting, you will need to configure this security to permit access from all of your servers running the Snowplow Enrichment process.

query

This is where the placeholder parameter from the last section comes into play. The ? symbol is a placeholder for the key(s) defined in the input section. The query is implemented as a prepared statement: a precompiled query used to execute the same or similar database statements repeatedly with high efficiency. It takes the form of a template into which certain constant values are substituted during each execution.

"query": {
      "sql": "SELECT firstname, lastname, email, reg_date FROM users WHERE email = ? LIMIT 1"
    }

In our example, we have only one placeholder, but there could be more than one; you would then have additional entries in the inputs array.

Some further notes on the behaviour of sql:

  • If a placeholder index required in the sql prepared statement is not found when searching the enriched event for the inputs, then the query will not proceed, but this will not be flagged as a failure
  • A final ; is optional
  • This enrichment makes no attempt to sanitize the SQL statement, nor to verify that the SQL statement does not have harmful side effects (such as SQL injection)

output

The enrichment adds the returned row(s) into the derived_contexts field within a Snowplow enriched event. Because all JSONs in the derived_contexts field must be self-describing JSONs, we use the schema field to specify the Iglu schema URI of the schema that describes the new contexts (which we showed you earlier in the section Phase 2: Schema preparation).

"output": {
      "expectedRows": "AT_MOST_ONE",
      "json": {
        "schema": "iglu:ca.properweb/user/jsonschema/1-0-0",
        "describes": "EVERY_ROW",
        "propertyNames": "AS_IS"
      }
    }

The expectedRows enum can take the following values:

  • EXACTLY_ONE - exactly one row is expected. 0 or 2 or more rows will throw an error, causing the entire event to fail to process
  • AT_MOST_ONE - either one or zero rows is expected. 2 or more rows will throw an error
  • AT_LEAST_ZERO - between 0 and N rows are expected - i.e. we are dealing with an array of results
  • AT_LEAST_ONE - between 1 and N rows are expected - i.e. a non-empty array of results. 0 rows will throw an error

For our purposes, we use AT_MOST_ONE because we are expecting to find either a single matching user in our MySQL users table or no match.

The describes property dictates whether the schema is the self-describing schema for all rows returned by the query, or whether the schema should be attached to each of the returned rows:

  • ALL_ROWS means that the schema should encapsulate all returned rows - i.e. one context will always be added to derived_contexts, regardless of how many rows that schema contains. Each returned row will be an array element within the self-describing JSON
  • EVERY_ROW means that the schema should be attached to each returned row - so e.g. if 3 rows are returned, 3 contexts with this same schema will be added to derived_contexts

In our case, the JSON Schema we created represents a single user, not an array of users, so we use EVERY_ROW.

The propertyNames property supports reformatting of the returned columns to fit the JSON Schema’s conventions better. Supported options are:

  • AS_IS - preserve the column names exactly as they are
  • CAMEL_CASE - so date_of_birth becomes dateOfBirth
  • PASCAL_CASE - so date_of_birth becomes DateOfBirth
  • SNAKE_CASE - so dateOfBirth becomes date_of_birth
  • LOWER_CASE - changes all characters to lowercase
  • UPPER_CASE - changes all characters to uppercase

If these options aren’t bespoke enough, remember that you can use column aliasing in your SQL statement to tweak individual column names.

For the purposes of our tutorial, the regular MySQL column names are fine.

##cache

A Snowplow enrichment can run many millions of time per hour, effectively launching a DoS attack on a data source if we are not careful! The cache configuration attempts to minimize the number of lookups performed.

The cache is an LRU (Least Recently Used) cache, where less frequently accessed values are evicted to make space for new values. The enrichment uses the results of the query as the cache keys. You can configure the cache as follows:

  • size is the maximum number of entries to hold in the cache at any one time
  • ttl is the number of seconds that an entry can stay in the cache before it is forcibly evicted. This is useful to prevent stale values from being retrieved in the case that your API can return different values for the same key over time

Part 5: Running and testing our enrichment

Thanks for sticking with this tutorial so far - we are now ready to put our enrichment into action.

To recap, we are dimension widening our Snowplow enriched events by using the value property in our change_form self-describing events to look up rows from a users table we are keeping in MySQL.

The email address expressed in the value property in our change_form will be used as the key to query the MySQL database during the Snowplow Enrichment process. The new data we want to obtain could be retrieved with the following SQL query:

SELECT firstname, lastname, reg_date 
FROM users WHERE email = {{ACTUAL_EMAIL_ADDRESS}} 
LIMIT 1

We do not expect more than one record to be returned because each email address should be unique within the users table. The new data will be added as a self-describing JSON to the derived_contexts array within our event.

We put our SQL Query Enrichment configuration live in our Snowplow installation and wait for a successful run (if batch) or for a few minutes to pass (if real-time).

Looking in Redshift, we are expecting to see the change_form event in the atomic.events table, the actual value for the captured email address in the atomic.com_snowplowanalytics_snowplow_change_form_1 table and finally the new (enriched) data obtained from our MySQL database in the atomic.ca_properweb_user_1 table.

We should be able to link the last 2 to the atomic_events table by means of the event_id = root_id relation:

SELECT e.collector_tstamp, f.value AS email, u.firstname, u.lastname, u.reg_date 
FROM atomic.ca_properweb_user_1 AS u
INNER JOIN atomic.com_snowplowanalytics_snowplow_change_form_1 AS f
ON u.root_id = f.root_id
INNER JOIN atomic.events AS e
ON e.event_id = f.root_id
LIMIT 2

Here are my results:

  collector_tstamp   |       email        | firstname | lastname  |      reg_date       
---------------------+--------------------+-----------+-----------+---------------------
 2016-06-27 15:07:04 | ihortom@eircom.net | Ihor      | Tomilenko | 2016-06-21 22:25:00
 2016-06-27 17:12:33 | pupkin@mail.ru     | Ivan      | Pupkin    | 2003-05-12 04:34:11
(2 rows)

And there we have it: the data from our MySQL database attached to our Snowplow enriched events.

We hope you enjoyed this tutorial on dimension widening your Snowplow events with your own proprietary data. Also, don’t forget to check out our companion tutorial:

Finally, as always, do please give this a go and share your experience and use cases with us in the comments.


Tutorial ideas for the upcoming SQL Query Enrichment?