Integrating Clearbit data into Snowplow using the API Request Enrichment [tutorial]


#1

Introduction

Snowplow’s new API Request Enrichment lets us perform dimension widening on an incoming Snowplow event using any internal or external HTTP-based API.

This tutorial walks you through a practical application of the enrichment, fetching data from the HTTP API of business data service Clearbit. This integration is running in production at Snowplow!

Ready? Let’s get started.

Business goal

The Snowplow website has various lead forms (one example) designed to capture interest in Snowplow services from prospective customers. When these forms are submitted, the lead information is posted to Snowplow as an event.

These events are defined by schemas stored in Snowplow’s internal Iglu schema registry. Here is an example lead event:

{
  "schema": "com.snowplowanalytics.snowplow-website/trial_form_submitted/jsonschema/2-0-0",
  "data": {
    "leadSource": "Trial Form Batch",
    "firstName": "Alexander",
    "lastName": "Dean",
    "email": "alex+discourse@snowplowanalytics.com",
    "company": "Snowplow Analytics Ltd",
    "eventsPerMonth": "< 1M",
    "suggestedCallDate": null
  }
}

The trial_form_submitted event is great as far as it goes, but it would be even more useful if it contained background information on the individual and their company. What is our prospect’s job title? What city is she in? How big is the company she works for?

Enter Clearbit: a business data startup offering a service called Enrichment:

Send us an email or website to instantly surface actionable data points like location, job title, company size and technology used.

With the API Request Enrichment, we can access Clearbit’s Enrichment data via HTTP API and augment our trial_form_submitted event with background information about our prospect and their company.

Solution design

This diagram sets out our planned integration with Clearbit:

Inputs

We will use the email field within our trial_form_submitted event as the field to lookup in Clearbit’s Enrichment API. The full schema key of our trial_form_submitted event is:

com.snowplowanalytics.snowplow-website/trial_form_submitted/jsonschema/2-0-0

There is also a 1-0-0 version of the schema - in fact both versions contain the email field.

The API request

Because our API Request Enrichment can only perform one HTTP request per event, we will use Clearbit’s Combined API:

A common use-case is looking up a person and company simultaneously based on a email address. To save you making two requests to do this, we offer a combined lookup API.

This endpoint expects an email address, and will return an object containing both the person and company (if found). A call to the combined lookup will only count as one API call.

Outputs

Clearbit will return a JSON containing sub-objects for both the Person and the Company.

This is a great fit for Snowplow’s support for contexts. We will augment our trial_form_submitted event with two additional contexts:

  1. A representation of Clearbit’s Person model
  2. A representation of Clearbit’s Company model

All clear? Let’s move onto the implementation.

Phase 1: Clearbit signup

We’ll need to signup to Clearbit to integrate their data - head over to their signup form and fill in your details (no need to provide a credit card).

Once you have logged in, click on API Keys in the left-hand menu to retrieve your API key:

Hold on to this as we will need this later.

Phase 2: Schema preparation

We are going to be augmenting our lead events with two new contexts, one each for Clearbit’s Person and Company models - this means creating two new schemas for Snowplow. This is the most complex part of the setup - let’s get started.

Two new schemas for Iglu Central

Snowplow requires all events and contexts to be defined as self-describing JSONs, referencing a schema available from an Iglu schema registry.

Therefore, we will need to define two JSON Schemas - let’s call these:

  1. com.clearbit.enrichment/person/jsonschema/1-0-0
  2. com.clearbit.enrichment/company/jsonschema/1-0-0

Because these two schemas could be useful to other Snowplow and Iglu users, it makes sense to submit them to Iglu Central for wider use. If the schemas related to a proprietary or in-house data source (as yours may do), you would keep them in you company’s private Iglu registry.

Deriving our schemas

I’m going to use Schema Guru to generate my schemas and associated Redshift tables and JSON Paths files.

First I saved locally the Clearbit example JSONs for both Person and Company models, found in their API documentation under the “Shell” tab in the right-hand pane:

$ head -4 examples/*.json
==> examples/company.json <==
{
  "id": "027b0d40-016c-40ea-8925-a076fa640992",
  "name": "Uber",
  "legalName": "Uber, Inc.",

==> examples/person.json <==
{
  "id": "d54c54ad-40be-4305-8a34-0ab44710b90d",
  "name": {
    "fullName": "Alex MacCaw",

First let’s download Schema Guru onto a machine with a JVM available:

$ wget https://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.6.1.zip
$ unzip schema_guru_0.6.1.zip

Now we’ll prepare some folders and then run Schema Guru on both sample JSONs:

$ mkdir -p schemas/com.clearbit.enrichment/person/jsonschema
$ mkdir -p schemas/com.clearbit.enrichment/company/jsonschema
$ ./schema-guru-0.6.1 schema examples/person.json --vendor com.clearbit.enrichment --name person --no-length --output schemas/com.clearbit/person/jsonschema/1-0-0
Possibly duplicated keys found:
avatar: avatars
$ ./schema-guru-0.6.1 schema examples/company.json --vendor com.clearbit.enrichment --name company --no-length --output schemas/com.clearbit/company/jsonschema/1-0-0

Making our schemas more tolerant

I then reviewed the JSON Schemas in an editor and made a few changes to make the schemas more permissive. This is important because schemas derived from just one JSON instance tend to be over-prescriptive, making it very likely that other instances in the future will fail validation against the schema.

The changes were:

  • Switching "additionalProperties" : false to true, to futureproof us for when Clearbit adds new properties
  • Taking almost all of the string properties and allowing them to take null as well
  • Where a schema property was set to null only, I added the other data type expected (either string or number)
  • For most of the numeric fields like Twitter followers I typically increased the size of integer used

You can see most of these schema widenings in this Iglu Central pull request.

Finally, I added the two new schemas to Iglu Central under the path schemas/com.clearbit.

Generating our JSON Paths and Redshift files

If we are loading our Snowplow events into Redshift, we will need table definitions and JSON Paths files to load the new Clearbit contexts into our database.

We can generate these like so:

$ ./schema-guru-0.6.1 ddl --with-json-paths schemas
File [./sql/com.clearbit.enrichment/person_1.sql] was written successfully!
File [./sql/com.clearbit.enrichment/company_1.sql] was written successfully!
File [./jsonpaths/com.clearbit.enrichment/person_1.json] was written successfully!
File [./jsonpaths/com.clearbit.enrichment/company_1.json] was written successfully!
Product type string,integer,null encountered in twitter.id
Product type string,integer,null encountered in twitter.id

I’ve now created Snowplow pull request #2635 to make these available in the main Snowplow repository.

Phase 3: Enrichment configuration

We’re now ready to write the JSON configuration for our API Request Enrichment.

We’ll call this file api_request_enrichment_config.json, and make sure it is available in the folder of enrichments passed in to Snowplow.

The wiki documentation for the API Request Enrichment gives a good overview of the various fields required - I’ll jump straight to the full configuration for Clearbit and then explain it below:

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/api_request_enrichment_config/jsonschema/1-0-0",

  "data": {
    "enabled": true,
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "name": "api_request_enrichment_config",
    "parameters": {
      "inputs": [
        {
          "key": "email",
          "json": {
            "field": "unstruct_event",
            "schemaCriterion": "iglu:com.snowplowanalytics.snowplow-website/trial_form_submitted/jsonschema/*-*-*",
            "jsonPath": "$.email"
          }
        }
      ],
      "api": {
        "http": {
          "method": "GET",
          "uri": "https://person-stream.clearbit.com/v2/combined/find?email={{"{{email}}"}}",
          "timeout": 15000,
          "authentication": {
            "httpBasic": {
              "username": "{{secret "clearbit-api-key"}}",
              "password": ""
            }
          }
        }
      },
      "outputs": [ 
        {
          "schema": "iglu:com.clearbit.enrichment/person/jsonschema/1-0-0",
          "json": {
            "jsonPath": "$.person"
          }
        },
        {
          "schema": "iglu:com.clearbit.enrichment/company/jsonschema/1-0-0",
          "json": {
            "jsonPath": "$.company"
          }
        }        
      ],
      "cache": {
        "size": 200,
        "ttl": 6000
      }
    }
  }
}

Let’s go through the four main sections in turn:

inputs

The first section lets us specify an array of inputs to use as keys when performing your API lookup. We have only one input: the email property in our trial_form_submitte event, so we configure this section as follows:

  • The key for the extracted value will be email. We can use this to access the value in the next section
  • We specify the field name as unstruct_event
  • Because both versions of our event contain the email property, we set the schemaCriterion to iglu:com.snowplowanalytics.snowplow-website/trial_form_submitted/jsonschema/*-*-*, meaning all versions of the schema
  • We use a JSON Path statement to access the top-level email property

api

The api section lets you configure how the enrichment should access your HTTP API.

For Clearbit we are configuring an http request via the GET method to Clearbit’s Combined API. We parameterize the uri with the email address to lookup, as extracted in the enrichment’s inputs section.

For authentication Clearbit uses Basic HTTP, where the HTTP username is set to the user’s API key and the HTTP password is deliberately left blank.

Although Clearbit promise to respond with the person and company (even if neither have been seen before) within 2-4 seconds, we really can’t afford to lose any completed lead forms due to slow responses, so we set the timeout (after which the enrichment will fail) to 15 seconds.

outputs

This enrichment assumes that your API returns a JSON, which will contain one or more entities that you want to add to your event as derived contexts.

Our outputs array contains two items reflecting the fact that Clearbit’s Combined API returns a JSON with two sub-objects:

  1. The Clearbit Person entity, accessible under the person property
  2. The Clearbit Company entity, accessible under the company property

For both entries in the outputsrray, the json.jsonPath property tells the enrichment where to find the entity in the JSON returned by the API, and the schema specifies the schema to attach when making the extracted JSON entity self-describing.

cache

A Snowplow enrichment can run many millions of time per hour, effectively launching a DoS attack on a data source if we are not careful. The cache configuration attempts to minimize the number of lookups performed.

In a given Snowplow run we are not expecting many trial_form_submitted events, so we can get away with a small cache size - just 200 elements.

Similarly, we don’t expect the Clearbit data to be updated very regularly, so we can happily keep the ttl (time-to-live) for cache entries to 100 minutes.

Deployment and testing

Now we’re ready to deploy our new enrichment and test it. Let’s get started.

Deploying the enrichment

To put this enrichment into production we need to do three things.

First, I ensured my configuration file api_request_enrichment_config.json is available in the folder of enrichments passed in to Snowplow.

Next, I deployed the two tables for the Clearbit person and company entities into your Redshift database, in the same database schema (e.g. atomic) as your events table:

(These links will require updating after the Snowplow PR is merged.)

Finally, because the Clearbit schemas are part of Iglu Central, I also deployed the JSON Paths files into Snowplow’s centrally hosted s3://snowplow-hosted-assets S3 bucket:

(Again, these links will require updating after the Snowplow PR is merged.)

If you were working with an internal or proprietary data source, instead of expecting Snowplow to host these JSON Paths files, you would add these JSON Paths files into the S3 bucket used for your jsonpath_assets path in your config.yml.

Sending in an event

Remember that we want to attach Clearbit data to the lead forms submitted to our website.

This is easy - I just fill in my details on the lead form here and click the button:

Please remember that this is Snowplow’s live website - please don’t submit a lead unless you genuinely want to be contacted about a Snowplow trial!

Good - the populated trial_form_submitted event has now been sent into Snowplow now. The next Snowplow run should read this event and apply the API Request Enrichment to it.

Verifying the enrichment’s output

Once I was confident that the completed lead form event had been rotated to our collector’s S3 bucket, I kicked off a run of the Snowplow batch pipeline and waited for this to complete.

Then I checked that my lead had made it through:

# SELECT * FROM atomic.com_snowplowanalytics_snowplow_website_trial_form_submitted_2 ORDER BY root_tstamp DESC LIMIT 1;
-[ RECORD 1 ]-------+---------------------------------------
schema_vendor       | com.snowplowanalytics.snowplow-website
schema_name         | trial_form_submitted
schema_format       | jsonschema
schema_version      | 2-0-0
root_id             | e5471530-2282-4e5a-8212-b56cd64553eb
root_tstamp         | 2016-05-08 13:10:11
ref_root            | events
ref_tree            | ["events","trial_form_submitted"]
ref_parent          | events
lead_source         | Trial Form Batch
first_name          | Alex
last_name           | Dean
email               | alex@snowplowanalytics.com
company             | Snowplow Analytics Ltd
events_per_month    |
suggested_call_date |

Great! Now to check out the contents of our Clearbit person table:

# SELECT * FROM atomic.com_clearbit_enrichment_person_1;
-[ RECORD 1 ]--------+---------------------------------------------------------------------------------------------
schema_vendor        | com.clearbit.enrichment
schema_name          | person
schema_format        | jsonschema
schema_version       | 1-0-0
root_id              | e5471530-2282-4e5a-8212-b56cd64553eb
root_tstamp          | 2016-05-08 13:10:11
ref_root             | events
ref_tree             | ["events","person"]
ref_parent           | events
...
avatar               | https://d1ts43dypk8bqh.cloudfront.net/v1/avatars/09b6da3c-9272-4952-baed-1430545def2f
bio                  | Co-founder of @SnowplowData
email                | alex@snowplowanalytics.com
employment.domain    | snowplowanalytics.com
employment.name      | Snowplow
employment.role      | engineering
employment.seniority | executive
employment.title     | Co-founder
facebook.handle      |
fuzzy                | f
gender               | male
geo.city             | London
geo.country          | United Kingdom
geo.country_code     | GB
geo.lat              | 51.5073509
geo.lng              | -0.1277583
geo.state            | England
geo.state_code       | England
github.avatar        | https://avatars.githubusercontent.com/u/200511?v=3
github.blog          | http://snowplowanalytics.com
github.company       | Snowplow Analytics
github.followers     | 112
github.following     | 100
github.handle        | alexanderdean
...

I’ve omitted about half the columns for brevity but you can see that the Clearbit lookup is bringing back a ton of interesting data about the sales prospect (in this case, me). And of course the root_id maps onto the event_id of the parent trial_form_submitted event, allowing the two tables to be linked.

Finally, let’s check out the contents of the Clearbit company table:

# select * from atomic.com_clearbit_enrichment_company_1;
-[ RECORD 1 ]-------------+--------------------------------------------------------------------------------------------
schema_vendor             | com.clearbit.enrichment
schema_name               | company
schema_format             | jsonschema
schema_version            | 1-0-0
root_id                   | e5471530-2282-4e5a-8212-b56cd64553eb
root_tstamp               | 2016-05-08 13:10:11
ref_root                  | events
ref_tree                  | ["events","company"]
ref_parent                | events
...
crunchbase.handle         | organization/snowplow-analytics
description               | Enterprise-strength marketing and product analytics. community@snowplowanalytics.com
domain                    | snowplowanalytics.com
domain_aliases            | []
email_provider            | f
facebook.handle           |
founded_date              |
geo.city                  | London
geo.country               | United Kingdom
geo.country_code          | GB
geo.lat                   | 51.523151
geo.lng                   | -0.0819515
geo.postal_code           | EC2A 4RQ
geo.state                 |
geo.state_code            |
geo.street_name           | Scrutton Street
geo.street_number         | 32-38
geo.sub_premise           |
id                        | 3a0b08e4-0b4b-4e35-becf-bc2f44a9a336
legal_name                |
linkedin.handle           | company/snowplow-analytics-ltd
location                  | 32-38 Scrutton St, London EC2A 4RQ, UK
logo                      | https://logo.clearbit.com/snowplowanalytics.com
...
time_zone                 | Europe/London
twitter.avatar            | https://pbs.twimg.com/profile_images/3751207409/6450a31078e1e6e258e7533e54fb4caa_normal.png
twitter.bio               | Enterprise-strength marketing and product analytics. community@snowplowanalytics.com
twitter.followers         | 1571
twitter.following         | 2144
twitter.handle            | SnowPlowData
twitter.id                | 587648379
twitter.location          |
twitter.site              | http://t.co/TZNVFVQXd3
type                      | private
url                       | http://snowplowanalytics.com
utc_offset                | 1

Fantastic - another wealth of data, this time about the prospect’s company (in this case Snowplow itself). And again, this table is joinable back to the individual trial_form_submitted table for easy analysis.

So we have configured our Snowplow pipeline to fetch relevant business data from Clearbit and attach it to our incoming leads - all without writing a single line of code!

Adapting this tutorial for another API

Hopefully this tutorial has given you an idea of how you can use the new API Request Enrichment to dimension widen your Snowplow events with additional datapoints.

To adapt this tutorial for another HTTP API, review the API’s documentation and then check out the wiki documentation for the enrichment.

If your HTTP API requires something that isn’t yet supported by the API Request Enrichment, please create a ticket!

Please share any success stories working with this enrichment in this thread, or create a new topic with your own tutorial!


How to enrich events with MySQL data using the SQL Query Enrichment [tutorial]
Tutorial ideas for the upcoming SQL Query Enrichment?