[ERROR] Updating Snowplow Enricher - ResolutionError

Hello Everybody,

How are you? Hope you are all OK.
I need your help!!!

Let me explain the context of the problem.
we use AWS stack with Snowplow. We have 4 kinesis data streams:
used by Collector:
raw_data
bad_raw_data

Used by Enricher:
enriched_data
bad_enriched_data

This is our Java Version, collector and enricher:

OpenJDK 64-Bit Server VM (build 17-ea+11-Ubuntu-114.042, mixed mode, sharing)
snowplow-stream-enrich-kinesis-3.2.2.jar
snowplow-stream-collector-kinesis-2.7.0.jar

Collector is working perfectly, it’s in version: snowplow-stream-collector-kinesis-2.7.0.jar
the execution line is:

/usr/bin/java -Dorg.slf4j.simpleLogger.defaultLogLevel=debug -Xms512m -Xmx1024m -jar /srv/snowplow/bin/snowplow-stream-enrich-kinesis-3.2.2.jar --config /srv/snowplow/conf/enrich_new.conf --resolver file:/srv/snowplow/conf/iglu_test_rcc.json --enrichments file:/srv/snowplow/data/enrichments

Regarding Our enricher, this is the version: snowplow-enrich-kinesis-3.2.2.jar

This is the execution line:

/usr/bin/java -Dorg.slf4j.simpleLogger.defaultLogLevel=debug -Xms512m -Xmx1024m -jar /srv/snowplow/bin/snowplow-enrich-kinesis-3.2.2.jar --config /srv/snowplow/conf/enrich_new.conf --iglu-config /srv/snowplow/conf/iglu_test_rcc.json --enrichments /srv/snowplow/data/enrichments

so, this is our config files:

  • Enricher conf:
enrich {

  streams {

    in {
      raw = "rawdata-integration"
    }

    out {
      enriched = "enricheddata-integration"
      bad = "rawdatabad-integration"
      pii = ""
      partitionKey = "event_id"
    }

    sourceSink {
      enabled =  "kinesis"

      region = "eu-west-1"

      aws {
        accessKey = "iam"
        secretKey = "iam"
      }

      maxRecords = 10000

      initialPosition = "TRIM_HORIZON"

      backoffPolicy {
        minBackoff = 1000
        maxBackoff = 60000
      }
    }

    buffer {
      byteLimit = 4500000
      recordLimit = 500
      timeLimit = 60000
    }

    appName = "snowplow_enrich_progress-integration"
  }

}
  • the iglu-json Resolver:
{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 1,
    "repositories": [
      {
        "name": "busuu Iglu Repo",
        "priority": 5,
        "vendorPrefixes": [ "com.customstuff" ],
        "connection": {
          "http": {
            "uri": "file:///srv/snowplow/data"
          }
        }
      },
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "file:///srv/snowplow/data"
          }
        }
      }
    ]
  }
}

As you can see, we have our schemas in our local file system, in /srv/snowplow/data we have a folder called: /schemas that contain everything.

Regarding the enrichment, we only have the ip-lookup.json

{
    "schema": "iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/2-0-0",

    "data": {

        "name": "ip_lookups",
        "vendor": "com.snowplowanalytics.snowplow",
        "enabled": true,
        "parameters": {
            "geo": {
                "database": "GeoIP2-City.mmdb",
                "uri": "s3://mybucket/integration"
      }
    }
  }
}

The error is this one:

usr/bin/java -Dorg.slf4j.simpleLogger.defaultLogLevel=debug -Xms512m -Xmx1024m -jar /srv/snowplow/bin/snowplow-stream-enrich-kinesis-3.2.2.jar --config /srv/snowplow/conf/enrich_new.hocon --resolver file:/srv/snowplow/conf/iglu_test_rcc.json --enrichments file:/srv/snowplow/data/iglu-central-master/
[main] DEBUG scalacache.guava.GuavaCache - Cache miss for key SchemaKey(com.snowplowanalytics.snowplow,enrichments,jsonschema,Full(1,0,0))
[main] DEBUG scalacache.guava.GuavaCache - Cache miss for key SchemaKey(com.snowplowanalytics.snowplow,enrichments,jsonschema,Full(1,0,0))
[main] DEBUG scalacache.guava.GuavaCache - Inserted value into cache with key SchemaKey(com.snowplowanalytics.snowplow,enrichments,jsonschema,Full(1,0,0))
{"error":"ResolutionError","lookupHistory":[{"repository":"Iglu Central","errors":[{"error":"RepoFailure","message":"sun.net.www.protocol.file.FileURLConnection:file:/srv/snowplow/data/schemas/com.snowplowanalytics.snowplow/enrichments/jsonschema/1-0-0 (of class sun.net.www.protocol.file.FileURLConnection)"}],"attempts":1,"lastAttempt":"2022-07-28T11:55:14.292Z"},{"repository":"Iglu Client Embedded","errors":[{"error":"NotFound"}],"attempts":1,"lastAttempt":"2022-07-28T11:55:14.307Z"},{"repository":"busuu Iglu Repo","errors":[{"error":"RepoFailure","message":"sun.net.www.protocol.file.FileURLConnection:file:/srv/snowplow/data/schemas/com.snowplowanalytics.snowplow/enrichments/jsonschema/1-0-0 (of class sun.net.www.protocol.file.FileURLConnection)"}],"attempts":1,"lastAttempt":"2022-07-28T11:55:14.315Z"}]}

I dont know which is the next step. We want to have all our Schemas inside our machine in local, it seems that it doenst like the 1-0-0 Resolver btw.

Thank you all a lot !!!
Best,
Raúl

Hi @RaulCC ,

Welcome to Snowplow community !

Thank you for providing all the details.

The error comes from this line:

            "uri": "file:///srv/snowplow/data"

Schemas need to be exposed via http. What you could do is run a simple http server in /srv/snowplow/data (e.g. with python3 -m http.server) and then point to it in your resolvers:

            "uri": "http://localhost:8000"

(not tested)

Also, please note that stream-enrich-kinesis is to be deprecated in favor of enrich-kinesis, which comes with more features (e.g. metrics). Instructions to set it up can be found here.

Hello again BenB,

Let me share with you the advances we have done about snowplow.
As you stated, with the exposing of

python3 -m http.server

It seemed to work. At least for testing looks suitable. Is it Mandatory to have the schemas somehow exposed? Earlier versions of the Enricher allowed to point to the schemas locally, which seems more logical to me, i mean that feature was useful
Regarding to use only enrich-kinesis, i have made the changes and executed it with the above python server up.
Let me share with you my configuration:

{
  "input": {
    "type": "Kinesis"
    "appName": "snowplow_enrich_progress-integration"
    "streamName": "rawdata-integration"
    "region": "eu-west-1"
    "initialPosition": {
      "type": "TRIM_HORIZON"
    }

    # Optional, set the mode for retrieving records.
    "retrievalMode": {
      "type": "Polling"
      "maxRecords": 10000
    }
    "bufferSize": 3
    "checkpointBackoff": {
      "minBackoff": 100 milliseconds
      "maxBackoff": 100 seconds
      "maxRetries": 10
    }
  }

  "output": {
    "good": {
      "type": "Kinesis"
      "streamName": "enricheddata-integration"
      "region": "eu-west-1"
      "backoffPolicy": {
        "minBackoff": 100 milliseconds
        "maxBackoff": 100 seconds
        "maxRetries": 10
      }
    }

    # Bad rows output
    "bad": {
      "type": "Kinesis"
      "streamName": "rawdatabad-integration"
      "region": "eu-west-1"
      "backoffPolicy": {
        "minBackoff": 100 milliseconds
        "maxBackoff": 100 seconds
        "maxRetries": 10
      }
      "recordLimit": 500
    }
  }

  "concurrency" : {
    "enrich": 256
    "sink": 1
  }

  "monitoring": {
    # Optional, configure how metrics are reported
    "metrics": {
      "cloudwatch": false
    }
  }

  "telemetry": {
    # Set to true to disable telemetry
    "disable": false
  }

  "featureFlags" : {
    "acceptInvalid": false
    "legacyEnrichmentOrder": false
  }
}

Regarding the iglu.json resolver:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 1,
    "repositories": [
     
      {
        "name": "busuu Iglu Repo",
        "priority": 5,
        "vendorPrefixes": [ "com.customstuff" ],
        "connection": {
          "http": {
            "uri": "http://localhost:8000"
          }
        }
      },
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://localhost:8000"
          }
        }
      }
    ]
  }
}

Finally the Ip-lookup:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/2-0-0",

    "data": {

        "name": "ip_lookups",
        "vendor": "com.snowplowanalytics.snowplow",
        "enabled": true,
        "parameters": {
            "geo": {
                "database": "GeoLite2-City.mmdb",
                "uri": "s3://custombucket/integration_maxmind_geo_ip"
      }
    }
  }
}

My execution line is this one:

sudo /usr/bin/java -Dorg.slf4j.simpleLogger.defaultLogLevel=debug -Xms512m -Xmx1024m -jar /srv/snowplow/bin/snowplow-enrich-kinesis-3.2.2.jar --enrichments /srv/snowplow/data/enrichments --iglu-config /srv/snowplow/conf/iglu.json --config /srv/snowplow/conf/enrich_config.hocon

But now im getting these errors:

  • The first one is: I’m missing some libraries:
java.lang.IllegalArgumentException: Failed to load any of the given libraries: [netty_tcnative_linux_x86_64, netty_tcnative_linux_x86_64_fedora, netty_tcnative_x86_64, netty_tcnative]
  • The second one is regarding AWS credentials: While downloading file from Bucket:
etInterceptor@46994ac1, software.amazon.awssdk.services.s3.internal.handlers.CopySourceInterceptor@7561cd23]
[pool-1-thread-1] DEBUG software.amazon.awssdk.auth.credentials.AwsCredentialsProviderChain - Unable to load credentials from SystemPropertyCredentialsProvider(): Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId).
software.amazon.awssdk.core.exception.SdkClientException: Unable to load credentials from system settings. Access key must be specified either via environment variable (AWS_ACCESS_KEY_ID) or system property (aws.accessKeyId).
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)
	at software.amazon.awssdk.auth.credentials.internal.SystemSettingsCredentialsProvider.resolveCredentials(SystemSettingsCredentialsProvider.java:58)

Even though i have exported my access key, it seems not to be working.
Seems like there are other ways like:
AWS_WEB_IDENTITY_TOKEN_FILE

[pool-1-thread-1] DEBUG software.amazon.awssdk.auth.credentials.AwsCredentialsProviderChain - Unable to load credentials from ProfileCredentialsProvider(): Profile file contained no credentials for profile 'default': ProfileFile(profiles=[])

AWS_CONTAINER_CREDENTIALS_FULL_URI or AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variables are set.

But for AWS_ACCESS_KEY_ID, it didn’t work.

So, to sum up, Right Now Collector is working without adding any kind of information about aws, except for this line of config:

      aws {
        accessKey = "iam"
        secretKey = "iam"
      }

If i change the Ip-look-up file for:

		"parameters": {
			"geo": {
				"database": "GeoLite2-City.mmdb",
				"uri": "http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind"
			}
		}

Then, it indeeds download the file since i guess that bucket is somehow public, but i cannot read from the shards of my data stream because i have not configured the credentials. How do i do it?

not authorized to perform: kinesis:ListShards on resource:because no identity-based policy allows the kinesis:ListShards action (Service: Kinesis, Status Code: 400

Thank you in advance!!!

Hi @RaulCC ,

Great!

It is the recommended way. If you’re worried that your schemas will be accessible by everyone, there are a couple of things that you could do:

  1. Configure your network so that only Snowplow applications can communicate with the HTTP server exposing the schemas
  2. Set up an Iglu server with an API key

Provided that your schemas are in /srv/snowplow/data/schemas and that /srv/snowplow/data is in your classpath, having

{
        "name": "local Iglu repository",
        "priority": 5,
        "vendorPrefixes": [ "com.customstuff" ],
        "connection": {
          "embedded": {
            "path": "/data"
          }
        }
      }

should work (not tested).

It seems that the machine where you run enrich is missing some native libraries. You might want to check this page to install them. Otherwise you could run enrich thanks to its Docker image instead of java command (instructions here).

Have you set both AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID ? I can confirm that this is working.

Please note that you’re expected to use your own versions of MaxMind databases.

Hi again @BenB !!!

Thanks a lot for your help.
I managed to add the credentials to the machine, finally.

Have you set both AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID ? I can confirm that this is working.

I tried this before setting the credentials .aws and the env variables did not work. Anyway Fixed!

Please note that you’re expected to use your own versions of MaxMind databases .

It was a test since we did not have connectivity to our buckets. We have currently purchased all maxmind enterprise databases.

We would like to run the enricher locally without using a server for exposing the jsons.
We have All the json schemas from the Iglu Central updated + our own schemas in /srv/snowplow/data/schemas (We read this docu before: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/iglu-repositories/jvm-embedded-repo/)

About the paths, as you already know:

/srv/snowplow/ 
/srv/snowplow/conf --> Configuration files
/srv/snowplow/bin --> Binary files where the .jar files are
/srv/snowplow/data --> Two main folders here:
/srv/snowplow/data/schemas/ --> A copy of iglu central + our custom jsons under `com.customstuff` folder
/srv/snowplow/data/enrichments/ --> Ip-lookup.json for GeoIP2-City.mmdb

Regarding the initial solution about exposing the schemas, the Python3 http.server works fine only if i execute it in the folder /srv/snowplow/data.

As you stated

Provided that your schemas are in /srv/snowplow/data/schemas and that /srv/snowplow/data is in your classpath , having

We have added the path to the .bashrc(also to the .profile) in both the current user and root user just in case, but the enricher keeps saying it cannot found the schemas folder.

the iglu file:

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 1,
    "repositories": [
      {
        "name": "Iglu Central local",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "embedded": {
            "path": "/data"
          }
        }
      },
      {
        "name": "Custom Iglu Repo",
        "priority": 5,
        "vendorPrefixes": [ "com.customstuff" ],
        "connection": {
          "embedded": {
            "path": "/data"
          }
        }
      }
    ]
  }
}

and the start_enricher.sh

/usr/bin/java -Xms512m -Xmx1024m -jar /srv/snowplow/bin/snowplow-stream-enrich-kinesis-3.2.2.jar --config /srv/snowplow/conf/enrich.conf --resolver file:/srv/snowplow/conf/iglu.json --enrichments file:/srv/snowplow/data/enrichments > /dev/null 2>&1

The output error:

{
  "error": "ResolutionError",
  "lookupHistory": [
    {
      "repository":"Iglu Central local"
      "errors": [
        {
          "error": "NotFound"
        }
      ],
      "attempts": 1,
      "lastAttempt": "2022-08-01T12:25:54.865Z"
    },
    {
      "repository": "Iglu Client Embedded",
      "errors": [
        {
          "error": "NotFound"
        }
      ],
      "attempts": 1,
      "lastAttempt": "2022-08-01T12:25:54.875Z"
    },
    {
      "repository": "Custom Iglu Repo ",
      "errors": [
        {
          "error": "NotFound"
        }
      ],
      "attempts": 1,
      "lastAttempt": "2022-08-01T12:25:54.883Z"
    }
  ]
}

Does the enricher internally uses the $PATH variable to discover locally json schemas?
How do I force it to recognize the /data path thus the /schemas folder inside the /srv/snowplow/data?
Running an echo $PATH with both users shows that the /srv/snowplow/data path is in the variable!

Thanks in advance Ben!!!

Hi @RaulCC,

Even if you don’t expose your own schemas via http you could still use Iglu central for the other ones, so that you don’t need to manually sync your copy.

No, $PATH is used only to run commands (e.g. java, ls, cp …). Java uses something different called the CLASSPATH.

You might be able to overwrite it by adding this option to your java command: -cp .:$CLASSPATH:/srv/snowplow/data, so the whole command looking like this:

/usr/bin/java -Xms512m -Xmx1024m -cp .:$CLASSPATH:/srv/snowplow/data -jar /srv/snowplow/bin/snowplow-stream-enrich-kinesis-3.2.2.jar --config /srv/snowplow/conf/enrich.conf --resolver file:/srv/snowplow/conf/iglu.json --enrichments file:/srv/snowplow/data/enrichments > /dev/null 2>&1

Then /schemas is automatically added by Iglu client inside enrich.

(I’m off next week)