More robust JSON parsing in Redshift with Python UDFs

yali · May 4, 2016, 10:13am

Most Snowplow users querying their data in Redshift wont need to parse JSONs in Redshift, because Snowplow shreds self describing events and custom contexts into their own tables.

Occasionally however, it will be necessary or desirable to work with JSONs in Redshift. A couple of examples:

Sometimes data is captured as arrays in varchar fields. A common example is the form_classes and elements fields in the submit_form_1 that is populated using the Javascript form tracking.
Sometimes it can be useful to create complex data types like arrays when doing analysis like funnel or pathing analysis. (Because this gives you the flexibility to aggregate steps in a user journey into a single line of data, without being limited or knowing the number of steps ahead of time)

Unfortunately Redshift’s inbuilt JSON parsing functions are very brittle: they’ll break if just one input data point is not a valid JSON.

We therefore recommend using Redshift’s support for Python UDFs to write more robust functions for parsing JSON data. At minimum, it is straightforward to create a simple function that checks that a JSON is validated:

create or replace function is_json(j varchar(max))
  returns boolean
  stable as $$
    import json
    try:
      json_object = json.loads(j)
    except ValueError, e:
      return False
    return True
  $$ language plpythonu;

This can then be used to with a CASE statement to filter out invalid JSONs prior to applying one of Redshift’s inbuilt JSON parsing functions:

SELECT
CASE WHEN is_json(my_json_field) THEN my_json_field ELSE '{}' END AS filtered_jsons
...

yali · May 4, 2016, 10:14am

The folks at Periscope data have published a useful set of UDFs for parsing JSONs in Redshift here.

Please reply to the thread with any other useful resources for JSON parsing in Redshift!

Topic		Replies	Views
Skipping json validation and "validating" on enrichment instead? Enrichment	2	1548	January 13, 2017
Snowplow bad rows and POST requests	2	979	March 13, 2019
R71: JSON validation in Scala Common Enrich Enrichment	3	1715	November 27, 2017
Snoplow on Postgres - encoded fields, constraints, DISTKEYS and other doubts For engineers	2	663	June 3, 2019
Documentation on raw log JSON? Tracking SDKs	2	1567	May 31, 2016

More robust JSON parsing in Redshift with Python UDFs

Related Topics