How to run Storage Loader in PostgreSQL database

nishanth · July 20, 2017, 6:48am

Step 4: setting up alternative data stores » Setup PostgreSQL.

In my localhost I had successfully installed PostgreSQL and successfully created & configured “atomic”.“events” table and configured resolver.json.
I tried to run in \4-storage\storage-loader\deploy directory with unzipping two files
ie., snowplow-storage-loader & snowplow-emr-etl-runner from snowplow_emr_r88_angkor_wat.zip with config.yml configured?. Even though I am not having Amazon Awz Account.

When I tried to run the below command:

./snowplow-storage-loader --config config/config.yml --resolver config/resolver.json --targets config/targets/ --skip analyze

I’am getting the following error:

Unexpected error: (<unknown>): expected <block end>, but found BlockMappingStart while parsing a block mapping at line 70 column 4
org/jruby/ext/psych/PsychParser.java:219:in `parse'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/psych.rb:376:in `parse_stream'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/psych.rb:324:in `parse'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/psych.rb:251:in `load'
/home/hadoop/snowplow/4-storage/storage-loader/lib/snowplow-storage-loader/config.rb:52:in `get_config'
storage-loader/bin/snowplow-storage-loader:31:in `<main>'
org/jruby/RubyKernel.java:977:in `load'
uri:classloader:/META-INF/main.rb:1:in `<main>'
org/jruby/RubyKernel.java:959:in `require'
uri:classloader:/META-INF/main.rb:1:in `(root)'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

How should I need to configure config.yml file with postgreSQL installed. Below is my config.yml details

  aws:
   \# Credentials can be hardcoded or set in environment variables
    access_key_id: xxxx #<%= ENV['AWS_SNOWPLOW_ACCESS_KEY'] %>
    secret_access_key: yyy #<%= ENV['AWS_SNOWPLOW_SECRET_KEY'] %>
    s3:
      region: us-east-1
      buckets:
        assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
        jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
        log: ABCXYZ.s3-website-ap-southeast-1.amazonaws.com/logs #s3://emretlrunner/log
        raw:
          in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
            - s3://kinesis-sink         # e.g. s3://my-old-collector-bucket
            - s3://elasticbeanstalk-us-east-1-XXX/resources/environments/logs/publish/e-XXX
            #- s3:s3-ap-southeast-1.amazonaws.com/elasticbeanstalk-ap-southeast-1-123/resources/environments/logs/publish/e-abc12345y6
            #- ADD HERE         # e.g. s3://my-new-collector-bucket
          processing: s3://my-snowplow-bucket/processing
          archive: s3://my-snowplow-bucket/archive   #"s3://ABCXYZ.s3-website-ap-southeast-1.amazonaws.com/archive" # e.g. s3://my-archive-bucket/raw
        enriched:
          good: s3://my-snowplow-bucket/out       # e.g. s3://my-out-bucket/enriched/good
          bad: s3://my-snowplow-bucket/bad        # e.g. s3://my-out-bucket/enriched/bad
          errors: s3://my-snowplow-bucket/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
          archive:  s3://my-snowplow-bucket/storage-archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
        shredded:
          good: s3://my-snowplow-bucket/out       # e.g. s3://my-out-bucket/shredded/good
          bad: s3://my-snowplow-bucket/bad        # e.g. s3://my-out-bucket/shredded/bad
          errors: s3://my-snowplow-bucket/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
          archive: ADD HERE    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
    emr:
      ami_version: 5.5.0
      region: us-east-1        # Always set this
      jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
      service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
      placement: us-east-1b     # Set this if not running in VPC. Leave blank otherwise
      ec2_subnet_id: subnet-07924c62 # Set this if running in VPC. Leave blank otherwise
      ec2_key_name: clerp     #emr-keypair
      bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
      software:   
        hbase:     0.92.0           # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
        lingual:  '1.1'           # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
      # Adjust your Hadoop cluster below
      jobflow:
        job_name: Snowplow ETL # Give your job a name
        master_instance_type: m1.medium
        core_instance_count: 2
        core_instance_type: m1.medium
        core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
          volume_size: 100    # Gigabytes
          volume_type: "gp2"
          volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
          ebs_optimized: false # Optional. Will default to true
        task_instance_count: 0 # Increase to use spot instances
        task_instance_type: m1.medium
        task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
      bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
      additional_info:        # Optional JSON string for selecting additional features
  collectors:
    format: clj-tomcat #cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
  enrich:
    versions:
      spark_enrich: 1.9.0 # Version of the Spark Enrichment process
    continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
    output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
  storage:
    versions:
      rdb_shredder: 0.12.0        # Version of the Spark Shredding process
      hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
    download:
      folder: s3://ABCXYZ.s3-website-ap-southeast-1.amazonaws.com/output #/var/storageloader # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
     targets:
    - name: My PostgreSQL database
      type: postgres
      host: localhost
      database: snowplow
      port: 5432
      table: atomic.events
      username: power_user
      password: xxxx
      maxerror:
      comprows:
      ssl_mode: disable
      
  monitoring:
    tags: {} # Name-value pairs describing this job
    logging:
      level: DEBUG # You can optionally switch to INFO for production
    snowplow:
      method: get
      app_id: WebSite # e.g. snowplow
      collector: snowp-env.ap-southeast-1.elasticbeanstalk.com # e.g. d3rkrsqld9gmqf.cloudfront.net

Here is my resolver.json file:

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Snowplow PostgreSQL storage configuration",
    "self": {
        "vendor": "com.snowplowanalytics.snowplow.storage",
        "name": "postgresql_config",
        "format": "jsonschema",
        "version": "1-0-0"
    },
    "type": "object",
    "properties": {
        "name": {
            "type": "PostgreSQL enriched events storage"
        },
        "host": {
            "type": "localhost"
        },
        "database": {
            "type": "snowplow"
        },
        "port": {
            "type": "integer",
            "minimum": 1,
            "maximum": 65535
        },
        "sslMode": {
            "type": "DISABLE",
            "enum": ["DISABLE", "REQUIRE", "VERIFY_CA", "VERIFY_FULL"]
        },
        "schema": {
            "type": "atomic"
        },
        "username": {
            "type": "power_user"
        },
        "password": {
            "type": "hadoop"
        },
        "purpose": {
            "type": "ENRICHED_EVENTS",
            "enum": ["ENRICHED_EVENTS"]
        }
    },
    "additionalProperties": false,
    "required": ["name", "host", "database", "port", "sslMode", "schema", "username","password", "purpose"]
}

Kindly help me to resolve this issue.
Thanking You,
Nishanth

ihor · July 25, 2017, 10:46pm

@nishanth,

I can see a few issues with your set up. You are using incompatible combination of job runners and configuration file. Additionally, your resolver file doesn’t look right.

Specifically,

You are using EmrEtlRunner and StorageLoader for Snowplow R88 release
Your configuration file, config.yml, is a mixture of various versions. It appears to be for R89 release. However, it contains targets section which was removed in R87 release.

To run Snowplow R88, please, correct the configuration file as per sample here https://github.com/snowplow/snowplow/blob/r88-angkor-wat/3-enrich/emr-etl-runner/config/config.yml.sample

Your targets folder has to contain the relevant JSON file. See examples here: https://github.com/snowplow/snowplow/tree/master/4-storage/config/targets

Your resolver is sufficient to contain the reference to Iglu Central

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      }
    ]
  }
}

We host the storage schemas so you don’t have to.

Topic		Replies	Views
Unexpected error while running the Storage Loader Duplicate	2	2049	July 22, 2017
How to run StorageLoader in standalone system without connecting to external server eg: Amazon AWS Storage targets	1	1840	July 19, 2017
Errror while running StorageLoader Storage targets	3	1155	September 21, 2017
Trying to use StorageLoader with Stream Enrich without AWS Unsupported setup	5	2875	July 28, 2017
Load to PostgreSQL failing Data store sources	1	1253	April 12, 2018

How to run Storage Loader in PostgreSQL database

Related Topics