Most Snowplow users querying their data in Redshift wont need to parse JSONs in Redshift, because Snowplow shreds self describing events and custom contexts into their own tables.
Occasionally however, it will be necessary or desirable to work with JSONs in Redshift. A couple of examples:
- Sometimes data is captured as arrays in
varchar
fields. A common example is theform_classes
andelements
fields in the submit_form_1 that is populated using the Javascript form tracking. - Sometimes it can be useful to create complex data types like arrays when doing analysis like funnel or pathing analysis. (Because this gives you the flexibility to aggregate steps in a user journey into a single line of data, without being limited or knowing the number of steps ahead of time)
Unfortunately Redshift’s inbuilt JSON parsing functions are very brittle: they’ll break if just one input data point is not a valid JSON.
We therefore recommend using Redshift’s support for Python UDFs to write more robust functions for parsing JSON data. At minimum, it is straightforward to create a simple function that checks that a JSON is validated:
create or replace function is_json(j varchar(max))
returns boolean
stable as $$
import json
try:
json_object = json.loads(j)
except ValueError, e:
return False
return True
$$ language plpythonu;
This can then be used to with a CASE
statement to filter out invalid JSONs prior to applying one of Redshift’s inbuilt JSON parsing functions:
SELECT
CASE WHEN is_json(my_json_field) THEN my_json_field ELSE '{}' END AS filtered_jsons
...