Redshift auto-migrations doesn't work

Hi there,

First of all, I want to thank the developers for introducing auto-migrations for Redshift. This is a really useful feature, but I couldn’t get it to work. The problem is that the RDB Loader does not automatically create tables for new events. Below I describe the steps that have been taken to upgrade.

  1. First I upgraded to R119: Upgrade Guide · snowplow/snowplow Wiki · GitHub
  2. Updated schema version and added blacklistTabular in Redshift’s target config. This is how it looks now:
{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/4-0-0",
    "data": {
        "id": "5fab7395-35cb-413a-95cd-ebf7zoL9bf22",
        "name": "AWS Redshift enriched events storage",
        "host": "db-host",
        "database": "db-base",
        "port": 5439,
        "jdbc": {
            "ssl": false
        },
        "username": "db-user",
        "password": "db-password",
        "roleArn": "role",
        "schema": "atomic",
        "maxError": 1,
        "compRows": 20000,
        "sshTunnel": null,
        "purpose": "ENRICHED_EVENTS",
        "processingManifest": null,
        "blacklistTabular": null
    }
}
  1. Updated ami_version → 5.19.0, rdb_loader → 0.17.0, rdb_shredder → 0.16.0;
  2. Installed personal Iglu Server, checked existing schemas using rdbms table-check utility and imported them to the Iglu Server;
  3. Added server URL to iglu_resolver.json, at the same time updating the schema version of the resolver to 1.0.2:
{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-2",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 1,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },
      {
        "name": "Personal Iglu Repository",
        "priority": 0,
        "vendorPrefixes": [ "com.mydomain" ],
        "connection": {
          "http": {
            "uri": "https://iglu.mydomain.com",
            "apikey": "myApiKey"
          }
        }
      }
    ]
  }
}
  1. Started EMR as follows:

./snowplow-emr-etl-runner run -c ./config.yml -r ./iglu_resolver.json -t ./targets

As a result, everything worked without errors, however, a table for a new event (the schema of which was previously added to Iglu Server) was not created automatically. What have I done wrong? Maybe I missed some flags or configs?

Thanks in advance for any help.

Hi @megasend,

First of all, I want to thank the developers for introducing auto-migrations for Redshift.

Thanks! It was indeed a long-waited feature and big effort to implement it.

I think what happened in your case is that all those events went to the bad bucket. In our experience, the most common reason is that RDB Shredder couldn’t fetch the “orderings endpoint” for a particular schema.

Also, was it your schema or one from Iglu Central? Because if it was from Iglu Central and you don’t have it on your Server - it won’t be able to shred into TSV as well.

Anyways, its worth to check the bad bucket as a first step.

Hi @anton,
Thanks for your reply!
This is my own scheme, not from Iglu Central:

{
  "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
  "description": "Schema for Test Event",
  "self": {
    "vendor": "com.mydomain",
    "name": "TestEvent",
    "format": "jsonschema",
    "version": "1-0-0"
  },
  "type": "object",
  "properties": {
    "productId": {
        "description": "Id of the product",
        "type": ["integer"],
        "minimum": 0,
        "maximum": 9223372036854775807
    },
    "productName": {
        "description": "Name of the product",
        "type": ["string"],
        "maxLength": 150
    }
  },
  "required": ["productId"],
  "additionalProperties": false
}

So far, no events have been accepted that are consistent with this schema. I expected that the table for the new schema will be created, even if there are no corresponding events. Does it work like this? Or is the presence of events required? (I mean, it is unlikely that there will be anything in the bad bucket if events have not yet come from external systems)

Yes, presence of events is required. RDB Loader looks at shredded/good S3 path and finds out what event types / contexts have landed from there. If you’re sending this TestEvent and hasn’t hit shredded/good bucket (because it hit enriched/bad or shredded/bad), RDB Loader will have no means to know about its existence.

You might also want to paste the EmrEtlRunner’s output because it tells us what types it discovered.

Hi @anton,
Thanks for the tip. I sent an event manually, and it really ended up in a bad bucket with the following error:

Could not find schema with key iglu:com.mydomain/TestEvent/jsonschema/1-0-0 in any repository, tried: level: “error” repositories: [“Iglu Central [HTTP]”,“Iglu Client Embedded [embedded]”,“Personal Iglu Repository [HTTP]”]

After checking the server access logs, it turned out that the requests are executed as for a static Iglu server:
[GET] /schemas/com.mydomain/TestEvent/jsonschema/1-0-0

But the correct request (since the Iglu Server uses REST API) is as follows:
[GET] /api/schemas/com.mydomain/TestEvent/jsonschema/1-0-0

How can I indicate in the iglu_resolver config that my server is not static, but a full-fledged one?
Thanks.

Hi @megasend,

You need to append an /api endpoint to the end of your URI. So for example your static https://registry.com would become https://registry.com/api.

Resolver expects an endpoint that would give a schema when /schemas/vendor/name/jsonschema/1-0-0 requested.

Hi @anton,
It helped, now the requests to the schemas go through the correct URLs. However, the problem remained:

{“level”:“error”,“message”:“error: Could not find schema with key iglu:com.mydomain/TestEvent/jsonschema/1-0-0 in any repository, tried:\n level: "error"\n repositories: ["Iglu Central [HTTP]","Iglu Client Embedded [embedded]","Personal Iglu Repository [HTTP]"]\n”}],“failure_tstamp”:“2020-06-15T10:36:02.541Z”}

However, now I do not see any requests to this schema in the logs (it seems that the Loader is not trying to access the specified schema). The test event is added to Collector as follows:
Welcome | MyCollect

Where ue_pr is encoded json:

{
  "schema": "iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0",
  "data": {
    "schema": "iglu:com.mydomain/TestEvent/jsonschema/1-0-0",
    "data": {
      "productId": 45775,
      "productName": "prname"
    }
  }
}

Why are there no requests to the test schema in access logs, but nevertheless there is an error saying that it cannot be found?
Thanks!

It’s likely that this is to do with caching. Try rebooting enrich, and let us know whether or not that resolves it. :slight_smile:

Unfortunately, this didn’t help. Moreover, I updated the enricher and collector to the latest versions and removed Iglu Central repository from the iglu_resolver config, as I imported all public schemas to my own Iglu Server (repository priority is set to 0). Now I see additional requests that are executed for imported event schemas, but don’t see requests for my event. Nevertheless, error messages continue to appear in the bad bucker stating that the specified schema could not be found (the error log is indicated in the previous post).

Hi @megasend,

Apologies for the delay responding.

This error can only happen in one scenario: the Enrich job is not receiving the schema when it tries to grab it.

I would look to troubleshoot by a) ensuring that the schemas have been uploaded to the correct path, and b) ensuring that the Iglu client can actually access it (I notice all are listed as HTTP, perhaps there is some kind of network rule in place forcing https only?).

Best,