RDB Loader can't connect to Iglu server or Iglu central?

Hi!

We’re trying to setup the RDB Loader following this guide, but we’re getting some strange error messages regarding the Iglu server. We tried setting up the Iglu server following this guide and added our custom contexts as schemas. (Please also note that the link to the configuration guide on this section is broken)

Data discovery error with following issues:
Cannot get schemas for iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-*  {"schemaCriterion":"iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-*","error":{"error":"ResolutionError","lookupHistory":[{"repository":"Custom Iglu Server","errors":[{"error":"ClientFailure","message":"Error connecting to http://iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com using address iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com:80 (unresolved: false)"}],"attempts":1,"lastAttempt":"2021-10-07T07:34:01.829Z"},{"repository":"Iglu Central","errors":[{"error":"NotFound"}],"attempts":1,"lastAttempt":"2021-10-07T07:33:51.651Z"},{"repository":"Iglu Client Embedded","errors":[{"error":"NotFound"}],"attempts":1,"lastAttempt":"2021-10-07T07:34:01.833Z"}]}}

Our iglu-resolver looks like this

{
    "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
    "data": {
      "cacheSize": 500,
      "repositories": [
        {
          "name": "Iglu Central",
          "priority": 0,
          "vendorPrefixes": [ "com.snowplowanalytics" ],
          "connection": {
            "http": {
              "uri": "http://iglucentral.com"
            }
          }
        },
        {
          "name": "Custom Iglu Server",
          "priority": 0,
          "vendorPrefixes": [ "com.snowplowanalytics" ],
          "connection": {
            "http": {
              "uri": "http://iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/",
              "apikey": "<API-KEY>"
            }
          }
        }
      ]
    }
  }

We’ve also tried suffixing the uri with /api but it gives a similar error

Doing a health checks returns OK. curl iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/api/meta/health

There are similar errors from the shredder

"error":"RepoFailure","message":"no protocol: iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/api/schemas/com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1"}],"attempts":1,"lastAttempt":"2021-10-07T08:31:51.024Z"
{"repository":"Iglu Central","errors":[{"error":"NotFound"}],"attempts":1,"lastAttempt":"2021-10-07T08:31:50.980Z"}

Are you updating the <ACCOUNT>.<REGION> placeholders to the correct ones for your Iglu Server?

Yes! I just censored them here.

What’s very strange is also that it complains about Iglu Central, which is hosted by you?

And we’re not supposed to store a local copy of the standard schemas (web_page etc) in our custom Iglu Server?

Another thing I’ve been wondering about is the extremely slow performance of the Iglu server.

% time curl iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/api/schemas/se.kry -X GET -H "apikey: <KEY>"
0.01s user 0.01s system 0% cpu 1:15.61 total

1:15.6 is extremely slow? Could it be time out issue? But then again, it can’t reach the public Iglu Central either.

Currently, Iglu Central doesn’t support the schema list endpoints that is supported by a Iglu Server. It will start to support it after 1st November 2021. Until that date, you need to use your private Iglu server with RDB Loader. Can you confirm you copied Iglu Central schemas to your private Iglu server ? If you didn’t do that yet, you can follow this documentation. Let me know if this solves your problem.

Ohh, we didn’t realise we had to do this.

Please note that this links to itself and does not explain what MY-IGLU-URL is or what 00000000-0000-0000-0000-000000000000. After experimenting it is iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com (without API) and the master API key (not the write key)

For further information on Iglu Central, consult the Iglu Central setup guide.

When we’re now trying to run the shredder with this iglu resolver.

{
    "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
    "data": {
      "cacheSize": 500,
      "repositories": [
        {
          "name": "Custom Iglu Server",
          "priority": 0,
          "vendorPrefixes": [ "com.snowplowanalytics" ],
          "connection": {
            "http": {
              "uri": "http://iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/api",
              "apikey": "XXX"
            }
          }
        }
      ]
    }
  }

Now has the public schema, which can be verified with curl. We’re getting “all steps completed successfully”, but only bad output. If we inspect shredder bad rows in S3, it still seems to indicate that there is a connection error

            {
                "schemaKey": "iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0",
                "error": {
                    "error": "ResolutionError",
                    "lookupHistory": [
                        {
                            "repository": "Custom Iglu Server",
                            "errors": [
                                {
                                    "error": "RepoFailure",
                                    "message": "connect timed out"
                                }
                            ],
                            "attempts": 1,
                            "lastAttempt": "2021-10-08T09:32:42.131Z"
                        },
                        {
                            "repository": "Iglu Client Embedded",
                            "errors": [
                                {
                                    "error": "NotFound"
                                }
                            ],
                            "attempts": 1,
                            "lastAttempt": "2021-10-08T09:32:42.143Z"
                        }
                    ]
                }
            },

What could be wrong? Could it be something with permission in AWS where we can curl the Iglu server, but it can’t be reached in EMR?

If you’re sure the API key is correct, then yeah this smells to me like a permissions/network issue.

Likely, it’s the permissions associated with the EMR job itself, rather than the Iglu instance. Could be network ports or IAM roles.

These can be tricky to figure out and debug, so I think it might be worthwhile to spin up an EMR cluster and ssh in to manually check the connection, if you don’t find anything obvious from reviewing the permissions on the cluster.

1 Like

We now tried that. You can curl, the Iglu server/loadbalancer, but it takes 2 minutes to complete.

time curl iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/api/schemas/com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0 -X GET -H "apikey: <READ KEY>"
{"$schema":"http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#","self":{"vendor":"com.snowplowanalytics.snowplow","name":"ua_parser_context","format":"jsonschema","version":"1-0-0"},"description":"Schema for useragent context generated by ua-parser enrichment","properties":{"useragentFamily":{"type":"string"},"useragentMajor":{"type":["string","null"]},"useragentMinor":{"type":["string","null"]},"useragentPatch":{"type":["string","null"]},"useragentVersion":{"type":"string"},"osFamily":{"type":"string"},"osMajor":{"type":["string","null"]},"osMinor":{"type":["string","null"]},"osPatch":{"type":["string","null"]},"osPatchMinor":{"type":["string","null"]},"osVersion":{"type":"string"},"deviceFamily":{"type":"string"}},"additionalProperties":false,"type":"object","required":["useragentFamily","useragentMajor","useragentMinor","osFamily","deviceFamily"]}
real	2m10.360s
user	0m0.007s
sys	0m0.007s

Should it really be taking so long? Maybe I should raise this as another thread. Update: Did that.