More robust JSON parsing in Redshift with Python UDFs


#1

Most Snowplow users querying their data in Redshift wont need to parse JSONs in Redshift, because Snowplow shreds self describing events and custom contexts into their own tables.

Occasionally however, it will be necessary or desirable to work with JSONs in Redshift. A couple of examples:

  1. Sometimes data is captured as arrays in varchar fields. A common example is the form_classes and elements fields in the submit_form_1 that is populated using the Javascript form tracking.
  2. Sometimes it can be useful to create complex data types like arrays when doing analysis like funnel or pathing analysis. (Because this gives you the flexibility to aggregate steps in a user journey into a single line of data, without being limited or knowing the number of steps ahead of time)

Unfortunately Redshift’s inbuilt JSON parsing functions are very brittle: they’ll break if just one input data point is not a valid JSON.

We therefore recommend using Redshift’s support for Python UDFs to write more robust functions for parsing JSON data. At minimum, it is straightforward to create a simple function that checks that a JSON is validated:

create or replace function is_json(j varchar(max))
  returns boolean
  stable as $$
    import json
    try:
      json_object = json.loads(j)
    except ValueError, e:
      return False
    return True
  $$ language plpythonu;

This can then be used to with a CASE statement to filter out invalid JSONs prior to applying one of Redshift’s inbuilt JSON parsing functions:

SELECT
CASE WHEN is_json(my_json_field) THEN my_json_field ELSE '{}' END AS filtered_jsons
...

#2

The folks at Periscope data have published a useful set of UDFs for parsing JSONs in Redshift here.

Please reply to the thread with any other useful resources for JSON parsing in Redshift!