Strange behavior by Redshift StorageLoader - mapping "" to "f" and "0" to "t"


#1

On a recent batch process run, in my shredded / good folder, I see these records:

{"schema":{"vendor":"com.managedbyq","name":"oa_globals","format":"jsonschema","version":"1-0-0"},"data":{"version_content":"2.20.3","version_app":"1.7.1","extended_data":"","location_authorized":false,"location_auth_status":"0","location_device_enabled":true,"location_status":"0","wifi_enabled":true},"hierarchy":{"rootId":"c6468566-b1f8-4885-83d3-809c6569cfa5","rootTstamp":"2016-04-19 15:20:19.000","refRoot":"events","refTree":["events","oa_globals"],"refParent":"events"}}
{"schema":{"vendor":"com.managedbyq","name":"oa_globals","format":"jsonschema","version":"1-0-0"},"data":{"version_content":"2.20.3","version_app":"1.7.1","extended_data":"","location_authorized":false,"location_auth_status":"0","location_device_enabled":true,"location_status":"0","wifi_enabled":true},"hierarchy":{"rootId":"cabf4718-5cd0-4816-a976-7725733a1772","rootTstamp":"2016-04-19 15:20:19.000","refRoot":"events","refTree":["events","oa_globals"],"refParent":"events"}}```

Inexplicably, these are the records in my redshift table:<img src="/uploads/snowplowanalytics/original/1X/75e2185d0eebce94173489c04f4467a4ffcd2c31.png" width="690" height="23">

1. Why is extended_data coming through as "f", when it was a blank string in records?
2.  Why is location_status coming through as "t", when it was a "0" in the records?

Here's the table  - I created it using the schema guru:

<img src="/uploads/snowplowanalytics/original/1X/8c1fd5e703d2d93143459416852c6f3ef853c838.png" width="690" height="281">


Thanks,
Dan

#2

Hi @dweitzenfeld,

It sounds like there’s a confusion between boolean and string values.

Could you, please, share the JSON schema you used to validate the data as well as the JSONPaths file you used to load it into Redshift table?

Regards,
Ihor


#3

Json schema:
https://s3.amazonaws.com/q-snowplow-iglu/schemas/com.managedbyq/oa_globals/jsonschema/1-0-0
Jsonpath:

  "jsonpaths": [
    "$.schema.vendor",
    "$.schema.name",
    "$.schema.format",
    "$.schema.version",
    "$.hierarchy.rootId",
    "$.hierarchy.rootTstamp",
    "$.hierarchy.refRoot",
    "$.hierarchy.refTree",
    "$.hierarchy.refParent",
    "$.data.version_content",
    "$.data.version_app",
    "$.data.location_authorized",
    "$.data.location_auth_status",
    "$.data.location_device_enabled",
    "$.data.location_status",
    "$.data.wifi_enabled",
    "$.data.push_enabled",
    "$.data.extended_data"
  ]
}```

#4

@dweitzenfeld,

The order of the JSONPaths array elements must match the order of the columns in the target table.

If we take the column list of your table and the JSONPaths elements and try to line them up, here’s what we get.

Table Column List       | JSONPath elements order
---------------------------------------------------------
version_app             | $.data.version_content
version_content	        | $.data.version_app
extended_data	        | $.data.location_authorized
location_auth_status	| $.data.location_auth_status
location_authorized     | $.data.location_device_enabled
location_device_enabled	| $.data.location_status
location_status	        | $.data.wifi_enabled
push_enabled	        | $.data.push_enabled
wifi_enabled	        | $.data.extended_data

From the “table” above you can see that extended_data is lined up with the location_authorized which is a boolean value. The same goes to location_status which is lined up with the boolean wifi_enabled element. Thus the boolean values of wifi_enabled (true) and location_authorized (false) were converted to “string representation” on data load to Redshift table as those values were loaded into location_status and extended_data respectively (according to the mapping which JSONPaths represent).

You mentioned that the SQL code to create the table was generated by Schema Guru. What about JSONPaths file? Was it generated too or it’s an older version of manually created file?

Again, the two files (the SQL code to create the table and the elements in JSONPaths file) should match.

–Ihor


#5

Thanks Ihor - I had no idea that the two had to line up, and I had created the JSONPaths manually. I will fix that JSONPaths and I’m sure that will solve the problem.


#6

Hi guys,

I am having issue to importing fields with 2 words into redshift. As on the official document https://github.com/snowplow/snowplow/wiki/4-Loading-shredded-types?_sp=73bf6e88-c2fc-41c9-899f-e096d11a5f10.1463691292613. Using your example, will map version_app to versionApp on the JSONPath. I did it that way, then all the fields which having _ in Redshift (defined as camel case in JSONPath) are empty.

However I see your post of mapping version_app to $.data.version_content. May I know which should be the correct way? Are you eventually got your data imported?

Thanks

Sam


#7

Hi @mythsam,

The key name in JSONPath should match the name in JSON data file. JSONPath is like XPath for JSON (rather than XML document). You can validate your JSON with JSONPath expression using this tool, for example.

On the other side of the mapping, the order of the keys in JSONPath should follow the corresponding order of the fields in the Redshift table.

In the Amazon Redshift COPY syntax, a JSONPath expression specifies the explicit path to a single name element in a JSON hierarchical data structure.

Each JSONPath expression in the jsonpaths array corresponds to one column in the Amazon Redshift target table. The order of the jsonpaths array elements must match the order of the columns in the target table or the column list, if a column list is used.

http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html

One could depict this relationship in the following manner.

JSON data <--[match name]--  JSONPath  --[match order]--> Redshift field

Also remember that JSONPath file (as well as the corresponding Redshift table definition) could be generated automatically with our Schema Guru tool.

Regards,
Ihor