Trying to use StorageLoader with Stream Enrich without AWS


#1

Hi All,

i have configured javascript tracker–> scala stream collector --> Stream enrich --> PostgreSQL
But while running storage loader with below command

./snowplow-storage-loader --config config/config.yml --resolver config/resolver.json --targets config/targets/ --skip analyze

i am getting below error.

   >  Unexpected error: undefined method `[]=' for nil:NilClass
> /home/hadoop/snowplow/4-storage/storage-loader/lib/snowplow-storage-loader/config.rb:56:in `get_config'
> storage-loader/bin/snowplow-storage-loader:31:in `<main>'
> org/jruby/RubyKernel.java:977:in `load'
> uri:classloader:/META-INF/main.rb:1:in `<main>'
> org/jruby/RubyKernel.java:959:in `require'
> uri:classloader:/META-INF/main.rb:1:in `(root)'
> uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in `<main>'

Below is my configuration file(config.yml) for the storage loader(postgreSQL)

s3:
region: eu-west-1 # S3 bucket region
buckets:
in: ADD HERE
archive: ADD HERE
download:
folder: /home/hadoop/snowplow/4-storage/ # Postgres-only config option. Where to store the downloaded files
targets:
   - :name: "PostgreSQL enriched events storage"
     :type: postgres
     :host: localhost # Hostname of database server
     :database: snowplow # Name of database
     :port: 5432 # Default Postgres port
     :table: atomic.events
     :username: power_user
     :password: hadoop
     :maxerror: # Not required for Postgres

Below is the my resolver.json file

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      }
    ]
  }
}

Please help me to get config.yml file to run storage loader.
if anyone has already done please share me. I am struggling from past 3 days.

Thanks and Regards
Sandesh P


Can we Setup Snowplow without using AWS?
#2

@sandesh,

What version of the StorageLoader are you running?

Also, how did you implement this link Stream enrich --> PostgreSQL? Are you using Kinesis S3 to sink the events to S3? Your configuration doesn’t reflect where the enriched files will be taken from to load to Postgres.


#3

Hey @ihor thanks for the response…

We are using version R88 for the storage loader.
For stream enrich we are using below configuration things.
source = "stdin"
sink = "stdouterr"
we are investigating how to implement Stream enrich --> PostgreSQL add the events to the PostgreSQL database.
We are using sink = "stdouterr"
We didnt get any example regarding configuration of postgreSQL(config.yml), so please help me to load stream enrich data to postgreSQl.
Suggest us config.yml file inorder to run the storage loader.


#4

Hi @sandesh - this isn’t a supported topology for Snowplow currently.

There is no way of wiring Stream Enrich up to Postgres on-premise without using AWS currently. You are missing a whole component in the middle - Spark Shred, and this currently only runs on EMR.

This may change in the future (particularly in Snowplow Mini), but it’s not something we can help with at this time.


#5

#6

@sandesh,

Here’s the “architecture” you should be using (any of the two should do):

  • Enrichment done in Kinesis Enrich:
... -> Stream enrich -> Kinesis S3 -> S3 -> EmrEtlRunner (shredding) -> PostgreSQL
  • Enrichment done in EMR
... -> Stream raw -> Kinesis S3 -> S3 -> EmrEtlRunner (enrich + shredding) -> PostgreSQL

A sample of the config.yml for R88 is here. The database target JSON configuration file is here.

Note the “targets” section was removed from YAML configuration in R88 and replaced with JSON configuration file.

You can refer to Lambda architecture to clarify this setup.