New version of JSON schema for custom context causing errors with Storage Loader

ryanrozich · December 1, 2016, 4:57am

Today I made some updates to our json schemas and versioned them. Since I only added fields that were optional (these new fields accept null as value as well) I figured that I could just increment the minor version of the schema rather than the major version. I did this because I figured since these new fields are optional (nullable) this shouldn’t require a brand new redshift table

I created the new versions of the json schema files, ran the igluctl command to generate DDL files and jsonschema, ran the ddl files in redshift and updated the new schema and jsonpaths files to our s3 repository. Tonight we were re-processing some events and got the error pasted below

Im assuming that this has to do with the updates I made today to our jsonpaths files in our repo. When I looked at the json paths files that were created I noticed that the files were only versioned at the major version level. i.e the file names that were output by igluctl are named <schema>_1.json rather than <schema>_1-0-1.json so when I uploaded these to my repository, I overwrote the old version of the jsonpaths.

My questions are

(1) Does this sound like the right diagnosis for why Im seeing these errors below? The error doesn’t tell me which jsonpaths it is failing on but the timing of these errors with my version update today makes me think this has to be the case

(2) Was I right to update the minor version of the schema rather than create a new major version? I’d like to be able to add fields to my contexts that are nullable without having to manage a bunch of different versions of tables in redshift.

(3) If both answers above are yes, how can I generate jsonpaths files that will allow custom contexts from the previous minor version schema to be loaded along with contexts with the more recent versions?

COMMIT;: ERROR: Number of jsonpaths and the number of columns should match. JSONPath size: 12, Number of columns in table or column list: 19
  Detail:
  -----------------------------------------------
  error:  Number of jsonpaths and the number of columns should match. JSONPath size: 12, Number of columns in table or column list: 19
  code:      8001
  context:
  query:     85016
  location:  s3_utility.cpp:670
  process:   padbmaster [pid=32510]
  -----------------------------------------------
uri:classloader:/storage-loader/lib/snowplow-storage-loader/redshift_loader.rb:89:in `load_events_and_shredded_types'
uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
uri:classloader:/storage-loader/bin/snowplow-storage-loader:54:in `block in (root)'
uri:classloader:/storage-loader/bin/snowplow-storage-loader:51:in `<main>'
org/jruby/RubyKernel.java:973:in `load'
uri:classloader:/META-INF/main.rb:1:in `<main>'
org/jruby/RubyKernel.java:955:in `require'
uri:classloader:/META-INF/main.rb:1:in `(root)'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

mike · December 1, 2016, 5:30am

Hi @ryanrozich

If you’re just adding fields to a schema then incrementing the patch (addition) number is the correct thing to do, so it sounds like that’s on track.

The error you’re seeing come up typically occurs when the JSON path file doesn’t reflect the state of the Redshift table. To map property names in JSON to column names in Redshift the JSONPath should have one entry per column but it sounds like your JSON path file only has 12 columns rather than the 19 expected. For the change you are making (addition) you’d expect the same _1.json path with 19 entries to overwrite the existing file.

It sounds like you’ve quite possible done this so it may be worth double checking the JSON path file in S3 to make sure it reflects the structure of the table you are loading into.

ryanrozich · December 1, 2016, 9:00pm

Thanks @mike you were right the jsonpaths number did not match the number of columns

For some reason the redshift loader wants the files to have an underscore in the filename rather than a dash. Our schema names use hyphens to separate the words and I guess when we first set up our pipeline someone figured out that these needed to be converted to underscores to get the loader to work. So when I uploaded the new jasonpaths files it did not overwrite the existing ones, and the loader was still loolking for the old file names and throwing this error because those didn’t match the number of columns in the table.

When I replaced hyphens with underscores in my new jsonpaths file names that seems to have fixed the problem and the loader seems to be chugging along. Going forward I can make file renaming part of our schema build process, but is this expected behavior that when the schema names are hyphen separated the jsonpaths file names need to be manually converted to use underscores?

mike · December 1, 2016, 9:46pm

That’s correct, SchemaVer uses hyphens as a separator and the JSON path file uses an underscore before the major version number. From the sounds of it are you creating JSON path files and uploading them manually?

If that’s the case I’d highly recommend using the igluctl tool which automates a large part of this process - from generating the Redshift DDL and JSON paths from your schema to uploading these to S3 (and Snowplow Mini). This will take care of any manual processes that you may be doing at the moment and is relatively simple to incorporate into a build process if required.

ryanrozich · December 1, 2016, 11:02pm

Thanks @mike . I am using the igluctl tool to generate the jsonpaths file. Thats whats outputting the hyphen-separated files, I then need to manually change the files output by iglu to use underscores

mike · December 1, 2016, 11:09pm

That’s odd, the JSON path file should have an underscore. Could you possibly share the commands you are running and the filenames/results you are seeing?

ryanrozich · December 2, 2016, 2:48am

sure @mike here is a listing of my json schema

Ryans-MacBook-Pro:snowplow-schemas rrozich$ ls schemas/com.onespot/**/*
schemas/com.onespot/experiment/jsonschema:
1-0-0

schemas/com.onespot/onsite-link-click/jsonschema:
1-0-0

schemas/com.onespot/onsite-recommendation-unit-response/jsonschema:
1-0-0

schemas/com.onespot/onsite-recommendations/jsonschema:
1-0-0

schemas/com.onespot/onsite-unit/jsonschema:
1-0-0	1-0-1

schemas/com.onespot/pageview-context/jsonschema:
1-0-0	1-1-0	2-0-0	2-0-1

schemas/com.onespot/user-id/jsonschema:
1-0-0

Notice that we use hyphens to separate words in the schema names

Here is the command to run igluctl for generating DDLs and jsonpaths

Ryans-MacBook-Pro:snowplow-schemas rrozich$ ~/igluctl static generate --with-json-paths ./schemas/
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/pageview-context_2.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/experiment_1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/onsite-recommendations_1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/onsite-recommendation-unit-response_1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/onsite-unit_1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/pageview-context_1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/user-id_1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/onsite-link-click_1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/pageview-context_2.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/experiment_1.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/onsite-recommendations_1.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/onsite-recommendation-unit-response_1.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/onsite-unit_1.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/pageview-context_1.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/user-id_1.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./jsonpaths/com.onespot/onsite-link-click_1.json] was written successfully!
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/onsite-unit/1-0-0/1-0-1.sql] was not modified
File [/Users/rrozich/bidder/snowplow-schemas/./sql/com.onespot/pageview-context/2-0-0/2-0-1.sql] was not modified

Here are the resulting file names

Ryans-MacBook-Pro:snowplow-schemas rrozich$ ls jsonpaths/**/*
jsonpaths/com.onespot/experiment_1.json
jsonpaths/com.onespot/onsite-link-click_1.json
jsonpaths/com.onespot/onsite-recommendation-unit-response_1.json
jsonpaths/com.onespot/onsite-recommendations_1.json
jsonpaths/com.onespot/onsite-unit_1.json
jsonpaths/com.onespot/pageview-context_1.json
jsonpaths/com.onespot/pageview-context_2.json
jsonpaths/com.onespot/user-id_1.json

Notice that the file names output by igluctl are hyphen-separated. While the igluctl tool uses an underscore to separate name from version number, I also had to convert the hyphens in the name of the schema to underscores in order to get the storage loader error to go away. So, its at this point that I need to run another script to convert the hyphens in the file names to underscores.

Once I ran this script to convert all hyphens to underscores

# convert all jsonpath file names to use underscores
for i in `ls ./jsonpaths/com.onespot/*-*`
do
NEW=`echo $i|tr '-' '_'`
mv $i $NEW

done

the file names look like this:

Ryans-MacBook-Pro:snowplow-schemas rrozich$ ls jsonpaths/**/*
jsonpaths/com.onespot/experiment_1.json
jsonpaths/com.onespot/onsite_link_click_1.json
jsonpaths/com.onespot/onsite_recommendation_unit_response_1.json
jsonpaths/com.onespot/onsite_recommendations_1.json
jsonpaths/com.onespot/onsite_unit_1.json
jsonpaths/com.onespot/pageview_context_1.json
jsonpaths/com.onespot/pageview_context_2.json
jsonpaths/com.onespot/user_id_1.json

and only then when I uploaded the resulting files to our s3 bucket for the storage loader to read, I was able to get the storage loader running again without the errors that I pasted in my original post.

mike · December 2, 2016, 3:04am

That’s definitely a peculiar one as I’ve typically only seen events defined using underscores rather than hyphen separators. I’m not 100% sure but I suspect the cause may be this line here which is converting . and - characters to _. When the job runs it’s looking for those JSON paths to have an underscore rather than the hyphen in your original schemas.

ryanrozich · December 2, 2016, 3:30am

yep that line looks like the culprit. I wasn’t aware that we werent supposed to use hyphens in names for contexts/events.

In any event, I can keep running my shell script to postprocess the names of the jsonpaths files after igluctl completes since it appears that fixes the problem for me.

But for the benefit of other users it might be a nice feature for igluctl either:

(a) throw an error as part of the lint process to flag hyphens as bad separators or
(b) name the jsonpaths files the way the storage loader expects to consume them

Appreciate the help @mike! I would still be stuck on this without your initial answer.

mike · December 2, 2016, 3:35am

No worries. I wasn’t really aware before this that you couldn’t have hyphen names in events either.

If you’d like it may be worth filing an issue on Github to update the documentation so others can avoid doing the same thing.

dashirov · December 20, 2016, 12:44am

The code handling redshift data sink is in my opinion incorrectly overloads sqlsafe name of the event to construct the target jsonpaths file name. After all igluctrl generated the file with hyphens in it and the destination table ddl has hyphens tr-ed into underscores. So every other part of the platform is functional and in total agreement of what should happen, but the redshift sink. Smells like a bug that can be fixed in a few lines of code.
From another post on the same subject it seems guru follows igluctl logic and also creates hyphenated jsonpaths files that redshift sink gets stumbled with.

Topic		Replies	Views
Why is the Redshift table definition for a schema not the latest version? Storage targets	14	2738	October 11, 2016
Jsonschema -> jsonpaths data not loading in redshift Troubleshooting	2	684	March 21, 2022
Schema mismatch while creating atomic tables in Redshift AWS batch pipeline (Legacy)	3	1266	July 23, 2019
Versioning custom jsonschema For engineers	3	553	January 4, 2022
Redshift automatic table migrations RFC RFCs	1	2104	December 29, 2018

New version of JSON schema for custom context causing errors with Storage Loader

Related Topics