IgluError (JSON instance is not self-describing (schema property is absent)


#1

Hello All,

I am using below architecture for loading the events to database.

JavaScript Tracker --> Scala Stream Collector --> Stream enrich --> kinesis S3 --> S3 -> EmrEtlRunner (shredding) -> PostgreSQL/Redshit

i am using EmrEtlRunner version “snowplow_emr_r88_angkor_wat”

Upto EmrEtlRunner (shredding) configuration is successfully completed.

I have been stucked in storage loader.
While trying to run postgreSQl using below command.

./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/4-storage/config/resolver.json --targets snowplow/4-storage/config/targets/ --skip analyze

I am getting below error.

Error in [resolver.json] Not a self-describing JSON
Shutting down

My emretlrunner.yml file is below.

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: xxxxxxxxxxxxx
  secret_access_key: xxxxxxxxx
  #keypair: Snowplowkeypair
  #key-pair-file: /home/ubuntu/snowplow/4-storage/config/Snowplowkeypair.pem
  region: us-east-1
  s3:
	region: us-east-1
	buckets:
	  assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
	  jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
	  log: s3://unilogregion1/logs
	  raw:
		in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
		  - s3://unilogregion1/      # e.g. s3://my-old-collector-bucket
		processing: s3://unilogregion1/raw/processing
		archive: s3://unilogregion1/raw/archive   # e.g. s3://my-archive-bucket/raw
	  enriched:
		good: s3://unilogregion1/enriched/good        # e.g. s3://my-out-bucket/enriched/good
		bad: s3://unilogregion1/enriched/bad       # e.g. s3://my-out-bucket/enriched/bad
		errors: s3://unilogregion1/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
		archive: s3://unilogregion1/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
	  shredded:
		good: s3://unilogregion1/shredded/good        # e.g. s3://my-out-bucket/shredded/good
		bad: s3://unilogregion1/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
		errors: s3://unilogregion1/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
		archive: s3://unilogregion1/shredded/archive     # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
	ami_version: 5.5.0
	region: us-east-1       # Always set this
	jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
	service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
	placement: us-east-1a      # Set this if not running in VPC. Leave blank otherwise
	ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
	ec2_key_name: Snowplowkeypair
	bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
	software:
	  hbase:              # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
	  lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
	# Adjust your Hadoop cluster below
	jobflow:
	  job_name: Snowplow ETL # Give your job a name
	  master_instance_type: m2.4xlarge
	  core_instance_count: 2
	  core_instance_type: m2.4xlarge
	  core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
		volume_size: 100    # Gigabytes
		volume_type: "gp2"
		volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
		ebs_optimized: false # Optional. Will default to true
	  task_instance_count: 0 # Increase to use spot instances
	  task_instance_type: m2.4xlarge
	  task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
	bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
	configuration:
	  yarn-site:
		yarn.resourcemanager.am.max-attempts: "1"
	  spark:
		maximizeResourceAllocation: "true"
	additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
	spark_enrich: 1.9.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
	rdb_loader: 0.12.0
	rdb_shredder: 0.12.0        # Version of the Spark Shredding process
	hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
	level: DEBUG # You can optionally switch to INFO for production
  #snowplow:
	#method: get
	#app_id: unilog # e.g. snowplow
	#collector: 172.31.38.39:8082 # e.g. d3rkrsqld9gmqf.cloudfront.net

iglu_resolver.json file is below.

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
	"cacheSize": 500,
	"repositories": [
	  {
		"name": "Iglu Central",
		"priority": 0,
		"vendorPrefixes": [ "com.snowplowanalytics" ],
		"connection": {
		  "http": {
			"uri": "http://iglucentral.com"
		  }
		}
	  }
	]
  }
}

Please help me to resolve this error.
Do i need to change any configuration?


#2

Hi @sandesh,

UPD: You’re using EmrEtlRunner run subcommand which appeared in R91, while you mentioned that it is EmrEtlRunner from R88. You need either drop run or (preferably) update to R92.

Old answer:

I think there’s some mistake in your target configurations. Message Error in [file] Not a self-describing JSON can be printed only when EmrEtlRunner loads targets. Can you confirm you didn’t accidentally put same file into your snowplow/4-storage/config/targets/? Your resolver configuration looks correct for me.


#3

Thanks for th quick reply @anton

My primary question is if i update to R92, when i unzip that found only snowplow-emr-etl-runner, how to run the storage using this?

In the targets section there are 4 json files.

  1. dynamodb.json
    2.elasticsearch.json
    3.postgres.json
    4.redshift.json

I have only updated postgres.json file as below and other json files i have left as it is.

{
	"schema": "iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/jsonschema/1-0-0",
	"data": {
		"name": "PostgreSQL enriched events storage",
		"host": "localhost",
		"database": "snowplow",
		"port": 5432,
		"sslMode": "DISABLE",
		"username": "power_user",
		"password": "hadoop",
		"schema": "atomic",
		"purpose": "ENRICHED_EVENTS"
	}
}

Do i need to make any changes in json files?


#4

Since R90 storage-loading logic moved into EMR cluster, so users don’t need to bother about separate application. So, StorageLoader is simply gone.

You need to delete all other target configurations except one for Postgres, because otherwise EmrEtlRunner will fail due unaccessible storages.


#5

Thanks @anton.
I tried with R92 in the way you told.
Below is the command i used to run.

./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/4-storage/config/iglu_resolver.json --targets snowplow/4-storage/config/targets/ --skip analyze

Below is error details:

D, [2017-10-04T12:15:32.590000 #11136] DEBUG -- : Initializing EMR jobflow
D, [2017-10-04T12:15:45.623000 #11136] DEBUG -- : EMR jobflow j-2A55H8CMZU9J2 started, waiting for jobflow to complete...
I, [2017-10-04T12:27:48.635000 #11136]  INFO -- : No RDB Loader logs
F, [2017-10-04T12:27:48.960000 #11136] FATAL -- :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-2A55H8CMZU9J2 failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2017-10-04 12:21:27 +0000 - ]
 - 1. Elasticity S3DistCp Step: Raw s3://unilogregion1/ -> Raw Staging S3: COMPLETED ~ 00:02:00 [2017-10-04 12:21:29 +0000 - 2017-10-04 12:23:29 +0000]
 - 2. Elasticity S3DistCp Step: Raw S3 -> Raw HDFS: COMPLETED ~ 00:01:50 [2017-10-04 12:23:31 +0000 - 2017-10-04 12:25:21 +0000]
 - 3. Elasticity Spark Step: Enrich Raw Events: COMPLETED ~ 00:01:02 [2017-10-04 12:25:23 +0000 - 2017-10-04 12:26:25 +0000]
 - 4. Elasticity S3DistCp Step: Enriched HDFS -> S3: FAILED ~ 00:00:06 [2017-10-04 12:26:27 +0000 - 2017-10-04 12:26:34 +0000]
 - 5. Elasticity S3DistCp Step: Shredded S3 -> Shredded Archive S3: CANCELLED ~ elapsed time n/a [ - ]
 - 6. Elasticity S3DistCp Step: Enriched S3 -> Enriched Archive S3: CANCELLED ~ elapsed time n/a [ - ]
 - 7. Elasticity Custom Jar Step: Load PostgreSQL enriched events storage Storage Target: CANCELLED ~ elapsed time n/a [ - ]
 - 8. Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3: CANCELLED ~ elapsed time n/a [ - ]
 - 9. Elasticity S3DistCp Step: Shredded HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 10. Elasticity S3DistCp Step: Shredded HDFS -> S3: CANCELLED ~ elapsed time n/a [ - ]
 - 11. Elasticity Spark Step: Shred Enriched Events: CANCELLED ~ elapsed time n/a [ - ]
 - 12. Elasticity Custom Jar Step: Empty Raw HDFS: CANCELLED ~ elapsed time n/a [ - ]
 - 13. Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3: CANCELLED ~ elapsed time n/a [ - ]):
	uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:586:in `run'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
	uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:103:in `run'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in `send_to'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in `call_with'
	uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in `block in redefine_method'
	uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in `<main>'
	org/jruby/RubyKernel.java:979:in `load'
	uri:classloader:/META-INF/main.rb:1:in `<main>'
	org/jruby/RubyKernel.java:961:in `require'
	uri:classloader:/META-INF/main.rb:1:in `(root)'
	uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

Please help me to resolve this


#6

Please ask new questions in new threads @sandesh.


#7

i have asked this question in new thread. Below is the link.