How to run Storage Loader in PostgreSQL database


#1

Step 4: setting up alternative data stores » Setup PostgreSQL.

  • In my localhost I had successfully installed PostgreSQL and successfully created & configured “atomic”.“events” table and configured resolver.json.

  • I tried to run in \4-storage\storage-loader\deploy directory with unzipping two files
    ie., snowplow-storage-loader & snowplow-emr-etl-runner from snowplow_emr_r88_angkor_wat.zip with config.yml configured?. Even though I am not having Amazon Awz Account.

When I tried to run the below command:

./snowplow-storage-loader --config config/config.yml --resolver config/resolver.json --targets config/targets/ --skip analyze

I’am getting the following error:

Unexpected error: (<unknown>): expected <block end>, but found BlockMappingStart while parsing a block mapping at line 70 column 4
org/jruby/ext/psych/PsychParser.java:219:in `parse'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/psych.rb:376:in `parse_stream'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/psych.rb:324:in `parse'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/psych.rb:251:in `load'
/home/hadoop/snowplow/4-storage/storage-loader/lib/snowplow-storage-loader/config.rb:52:in `get_config'
storage-loader/bin/snowplow-storage-loader:31:in `<main>'
org/jruby/RubyKernel.java:977:in `load'
uri:classloader:/META-INF/main.rb:1:in `<main>'
org/jruby/RubyKernel.java:959:in `require'
uri:classloader:/META-INF/main.rb:1:in `(root)'
uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'
  • How should I need to configure config.yml file with postgreSQL installed. Below is my config.yml details

      aws:
       \# Credentials can be hardcoded or set in environment variables
        access_key_id: xxxx #<%= ENV['AWS_SNOWPLOW_ACCESS_KEY'] %>
        secret_access_key: yyy #<%= ENV['AWS_SNOWPLOW_SECRET_KEY'] %>
        s3:
          region: us-east-1
          buckets:
            assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
            jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
            log: ABCXYZ.s3-website-ap-southeast-1.amazonaws.com/logs #s3://emretlrunner/log
            raw:
              in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
                - s3://kinesis-sink         # e.g. s3://my-old-collector-bucket
                - s3://elasticbeanstalk-us-east-1-XXX/resources/environments/logs/publish/e-XXX
                #- s3:s3-ap-southeast-1.amazonaws.com/elasticbeanstalk-ap-southeast-1-123/resources/environments/logs/publish/e-abc12345y6
                #- ADD HERE         # e.g. s3://my-new-collector-bucket
              processing: s3://my-snowplow-bucket/processing
              archive: s3://my-snowplow-bucket/archive   #"s3://ABCXYZ.s3-website-ap-southeast-1.amazonaws.com/archive" # e.g. s3://my-archive-bucket/raw
            enriched:
              good: s3://my-snowplow-bucket/out       # e.g. s3://my-out-bucket/enriched/good
              bad: s3://my-snowplow-bucket/bad        # e.g. s3://my-out-bucket/enriched/bad
              errors: s3://my-snowplow-bucket/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
              archive:  s3://my-snowplow-bucket/storage-archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
            shredded:
              good: s3://my-snowplow-bucket/out       # e.g. s3://my-out-bucket/shredded/good
              bad: s3://my-snowplow-bucket/bad        # e.g. s3://my-out-bucket/shredded/bad
              errors: s3://my-snowplow-bucket/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
              archive: ADD HERE    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
        emr:
          ami_version: 5.5.0
          region: us-east-1        # Always set this
          jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
          service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
          placement: us-east-1b     # Set this if not running in VPC. Leave blank otherwise
          ec2_subnet_id: subnet-07924c62 # Set this if running in VPC. Leave blank otherwise
          ec2_key_name: clerp     #emr-keypair
          bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
          software:   
            hbase:     0.92.0           # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
            lingual:  '1.1'           # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
          # Adjust your Hadoop cluster below
          jobflow:
            job_name: Snowplow ETL # Give your job a name
            master_instance_type: m1.medium
            core_instance_count: 2
            core_instance_type: m1.medium
            core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
              volume_size: 100    # Gigabytes
              volume_type: "gp2"
              volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
              ebs_optimized: false # Optional. Will default to true
            task_instance_count: 0 # Increase to use spot instances
            task_instance_type: m1.medium
            task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
          bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
          additional_info:        # Optional JSON string for selecting additional features
      collectors:
        format: clj-tomcat #cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
      enrich:
        versions:
          spark_enrich: 1.9.0 # Version of the Spark Enrichment process
        continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
        output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
      storage:
        versions:
          rdb_shredder: 0.12.0        # Version of the Spark Shredding process
          hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
        download:
          folder: s3://ABCXYZ.s3-website-ap-southeast-1.amazonaws.com/output #/var/storageloader # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
         targets:
        - name: My PostgreSQL database
          type: postgres
          host: localhost
          database: snowplow
          port: 5432
          table: atomic.events
          username: power_user
          password: xxxx
          maxerror:
          comprows:
          ssl_mode: disable
          
      monitoring:
        tags: {} # Name-value pairs describing this job
        logging:
          level: DEBUG # You can optionally switch to INFO for production
        snowplow:
          method: get
          app_id: WebSite # e.g. snowplow
          collector: snowp-env.ap-southeast-1.elasticbeanstalk.com # e.g. d3rkrsqld9gmqf.cloudfront.net
    
  • Here is my resolver.json file:

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Snowplow PostgreSQL storage configuration",
    "self": {
        "vendor": "com.snowplowanalytics.snowplow.storage",
        "name": "postgresql_config",
        "format": "jsonschema",
        "version": "1-0-0"
    },
    "type": "object",
    "properties": {
        "name": {
            "type": "PostgreSQL enriched events storage"
        },
        "host": {
            "type": "localhost"
        },
        "database": {
            "type": "snowplow"
        },
        "port": {
            "type": "integer",
            "minimum": 1,
            "maximum": 65535
        },
        "sslMode": {
            "type": "DISABLE",
            "enum": ["DISABLE", "REQUIRE", "VERIFY_CA", "VERIFY_FULL"]
        },
        "schema": {
            "type": "atomic"
        },
        "username": {
            "type": "power_user"
        },
        "password": {
            "type": "hadoop"
        },
        "purpose": {
            "type": "ENRICHED_EVENTS",
            "enum": ["ENRICHED_EVENTS"]
        }
    },
    "additionalProperties": false,
    "required": ["name", "host", "database", "port", "sslMode", "schema", "username","password", "purpose"]
}

Kindly help me to resolve this issue.
Thanking You,
Nishanth


Atomic.events to postgres database
Atomic.events to postgres database
Unexpected error while running the Storage Loader
#2

@nishanth,

I can see a few issues with your set up. You are using incompatible combination of job runners and configuration file. Additionally, your resolver file doesn’t look right.

Specifically,

  • You are using EmrEtlRunner and StorageLoader for Snowplow R88 release
  • Your configuration file, config.yml, is a mixture of various versions. It appears to be for R89 release. However, it contains targets section which was removed in R87 release.

To run Snowplow R88, please, correct the configuration file as per sample here https://github.com/snowplow/snowplow/blob/r88-angkor-wat/3-enrich/emr-etl-runner/config/config.yml.sample

Your targets folder has to contain the relevant JSON file. See examples here: https://github.com/snowplow/snowplow/tree/master/4-storage/config/targets

Your resolver is sufficient to contain the reference to Iglu Central

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      }
    ]
  }
}

We host the storage schemas so you don’t have to.