RDB Loader is failing to find a JSONPath file

Farzin_Zaker · October 16, 2020, 5:39am

RDB Loader is raising the following error:

Data discovery error with following issues:
JSONPath file [com.snowplowanalytics.snowplow/parent_event_1.json] was not found

I cannot find such a JSON file in GitHub repositories.

Config file:

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: <%= ENV['AWS_ACCESS_KEY'] %>
  secret_access_key: <%= ENV['AWS_SECRET_KEY'] %>
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: s3://company-snowplow-schema-dev/jsonpaths/ # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3n://company-shared-production/redshift/var/log/snowplow_tracker/logs/
      encrypted: false
      raw:
        in:
          - s3://company-snowplow-raw-production/
        processing: s3://company-shared-production/redshift/var/log/snowplow_tracker/raw/processing/
        archive: s3://company-shared-production/redshift/var/log/snowplow_tracker/raw/archive/
      enriched:
        good: s3://company-shared-production/redshift/var/log/snowplow_tracker/enriched/good/       # e.g. s3://my-out-bucket/enriched/good
        archive: s3://company-shared-production/redshift/var/log/snowplow_tracker/enriched/archive/    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
        bad: s3://company-shared-production/redshift/var/log/snowplow_tracker/enriched/bad/     # S3 Loader's output folder with enriched data. If present raw buckets will be discarded
        errors: s3://company-shared-production/redshift/var/log/snowplow_tracker/enriched/errors/
#        stream: s3://company-shared-production/var/log/snowplow_tracker/enriched/stream/
      shredded:
        good: s3://company-shared-production/redshift/var/log/snowplow_tracker/shredded/good/       # e.g. s3://my-out-bucket/shredded/good
        bad: s3://company-shared-production/redshift/var/log/snowplow_tracker/shredded/rdb_loader/        # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://company-shared-production/redshift/var/log/snowplow_tracker/shredded/errors/     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://company-shared-production/redshift/var/log/snowplow_tracker/shredded/archive/    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
    consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3
  emr:
    ami_version: 5.9.0
    region: us-east-1        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:     # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-690b1543 # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: amplify-keypair
    security_configuration: # Specify your EMR security configuration if needed. Leave blank otherwise
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase: #"1.3.1" #"1.4.13"              # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual: #"1.1"             # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
 #  Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow to Redshift ETL # Give your job a name
      master_instance_type: m1.medium
      core_instance_count: 12
      core_instance_type: r3.2xlarge
      core_instance_bid: #0.015 # In USD. Adjust bid, or leave blank for on-demand core instances
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m4.large
      task_instance_bid: #0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark-defaults:
        maximizeResourceAllocation: "true"
#    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.19.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: GZIP # Stream mode supports only GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.14.0        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags:
    app: Snowplow ETL # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
#  snowplow:
#    method: get
#    app_id: redshift_loader # e.g. snowplow
#    collector: # e.g. d3rkrsqld9gmqf.cloudfront.net
#    protocol: http
#    port: 80

resolver:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },
      {
        "name": "company Central",
        "priority": 5,
        "vendorPrefixes": [ "com.company" ],
        "connection": {
          "http": {
            "uri": "http://company-snowplow-schema-dev.s3.amazonaws.com"
          }
        }
      }
    ]
  }
}

Anyone can help with this issue please?

Colm · October 16, 2020, 9:23am

Older versions of the pipeline require you to manually create and upload a jsonpath file for the loader. This can be done by running igluctl static generate --with-json-paths on your schemas (docs) then uploading them to your Iglu repo.

This will also produce a set of SQL files - these must be used to create the table in your database.

More recent versions of the pipeline do all of this automatically for you.

Farzin_Zaker · October 16, 2020, 4:15pm

@Colm, Thanks for your reply.

I tried to run igluctl on the parent_event schema I downloaded from the Github repository:

github.com

snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow/parent_event/jsonschema/1-0-0

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Context containing ID of parent event (useful for annotating events derived from other events)",
    "self": {
        "vendor": "com.snowplowanalytics.snowplow",
        "name": "parent_event",
        "format": "jsonschema",
        "version": "1-0-0"
    },

    "type": "object",
    "properties": {
        "parentEventId": {
            "type": "string",
            "format": "uuid"
        }
    },
    "required": ["parentEventId"],
    "additionalProperties": false
}

It fails with the following error:

% ./igluctl static generate --with-json-paths schemas/parent_event.json 

JSON schema in [schemas/parent_event.json] does not correspond to its metadata [iglu:com.snowplowanalytics.snowplow/parent_event/jsonschema/1-0-0]

Cannot read [schemas/parent_event.json]: no valid JSON Schemas

Colm · October 16, 2020, 4:28pm

Ah - the schema needs to live in a directory structure which mirrors the self portion of the schema.

So vendor/name/format/version.

deepshah7 · May 24, 2021, 5:41am

Yesterday, I faced the same issue. I was working with RDB Loader 1+.

To get around this issue, had to add jsonpaths to the RDB config file. This property is pointing to the S3 bucket that holds all the iglu JSON’s. Without this change my RDBLoader wasn’t working.

Dhruvi · April 13, 2023, 10:39am

hi @Colm -
Faced the same error. Abv answer helped
But now have ran into another issue with the sql files generated with igluctl - specifically with multiple schema versions. I am trying to generate ddl for com.snowplowanalytics.snowplow/mobile_context/ and it has 3 different schema version files, but it will only generate a sql file when I specifically run:

igluctl static generate com.snowplowanalytics.snowplow/mobile_context/jsonschema/1-0-0

and I dont see any Alter table commands in the file. I was referring to documentation here

Thanks,
Dhruvi

Topic		Replies	Views
RDB Loader cannot find jsonpath Announcements	1	605	March 3, 2021
ETL RDB Loader Error AWS batch pipeline (Legacy)	4	1389	February 10, 2018
JSON Paths File not Found For engineers	3	1321	December 22, 2016
Step [rdb_load] stdout: Configuration error Attempt to decode value on failed cursor: DownField(sslMode) Troubleshooting	10	2067	November 6, 2019
Help with provisioning rdb loader AWS batch pipeline (Legacy)	8	1585	November 10, 2018

RDB Loader is failing to find a JSONPath file

Related Topics