RDB Loader on ElasticBeanstalk AWS

Hi everyone, I’m quite new with snowplow so I managed to get the enriched/good sink on S3 and now I’m doing the next steps to save that on Redshift.

I’ve got some question:

  1. How schedule EMR jobs with S3DistCp and Shredder ? Is it by time that it runs “cron jobs”?
  2. I’m having an error while deploying in Elastic Beanstalk the docker image from snowplow/snowplow-rdb-loader:1.1.0

here is my config.hocon (that id should be created by my iglu server or it could be any uuid?)

{
  # Human-readable identificator, can be random
  "name": "{{ custom name}}",
  # Machine-readable unique identificator, must be UUID
  "id": "{{id created on iglu server with create permission}}",

  # Data Lake (S3) region
  "region": "us-west-2",
  # SQS topic name used by Shredder and Loader to communicate
  "messageQueue": "{{name_of_sqs}}.fifo",

  # Shredder-specific configs
  "shredder": {
    # "batch" for Spark job and "stream" for fs2 streaming app
    "type" : "batch",
    # For batch: path to enriched archive (must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories) for S3 input
    "input": "s3://{{path_to_enriched}}/archive/",
    # For stream: appName, streamName, region triple for kinesis
    #"input": {
    #  # kinesis and file are the only options for stream shredder
    #  "type": "kinesis",
    #  # KCL app name - a DynamoDB table will be created with the same name
    #  "appName": "acme-rdb-shredder",
    #  # Kinesis Stream name
    #  "streamName": "enriched-events",
    #  # Kinesis region
    #  "region": "us-east-1",
    #  # Kinesis position: LATEST or TRIM_HORIZON
    #  "position": "LATEST"
    #},
    # For stream shredder : frequency to emit loading finished message - 5,10,15,20,30,60 etc minutes
    # "windowing": "10 minutes",
    # Path to shredded archive
    "output": {
      # Path to shredded output
      "path": "s3://{{path_to_enriched}}/shredded/",
      # Shredder output compression, GZIP or NONE
      "compression": "GZIP"
    }
  },

  # Schema-specific format settings (recommended to leave all three groups empty and use TSV as default)
  "formats": {
    # Format used by default (TSV or JSON)
    "default": "TSV",
    # Schemas to be shredded as JSONs, corresponding JSONPath files must be present. Automigrations will be disabled
    "json": [ ],
    # Schemas to be shredded as TSVs, presence of the schema on Iglu Server is necessary. Automigartions enabled
    "tsv": [ ],
    # Schemas that won't be loaded
    "skip": [ ]
  },

  # Optional. S3 path that holds JSONPaths
  #"jsonpaths": "s3://bucket/jsonpaths/",

  # Warehouse connection details
  "storage" : {
    # Database, redshift is the only acceptable option
    "type": "redshift",
    # Redshift hostname
    "host": "{redshifhostname}.redshift.amazonaws.com",
    # Database name
    "database": "dev",
    # Database port
    "port": 5439,
    # AWS Role ARN allowing Redshift to load data from S3
    "roleArn": "{rolearn}/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift",
    # DB schema name
    "schema": "atomic",
    # DB user with permissions to load data
    "username": "{{username}}",
    # DB password
    "password": "{{pass}}",
    # Custom JDBC configuration
    "jdbc": {"ssl": true},
    # MAXERROR, amount of acceptable loading errors
    "maxError": 10
  },

  # Additional steps. analyze, vacuum and transit load are valid values
  "steps": ["vacuum"],

  # Observability and reporting options
  "monitoring": {
    # Snowplow tracking (optional)
    "snowplow": {
      "appId": "snowplow",
      "collector": "{{collector}}.elasticbeanstalk.com",
    }

    # Optional, for tracking runtime exceptions
    "sentry": {
      "dsn": ""
    },

    # Optional, configure how metrics are reported
    # "metrics": {
    #   # Optional, send metrics to StatsD server
    #   "statsd": {
    #     "hostname": "localhost",
    #     "port": 8125,
    #     # Any key-value pairs to be tagged on every StatsD metric
    #     "tags": {
    #       "app": "rdb-loader"
    #     }
    #     # Optional, override the default metric prefix
    #     # "prefix": "snowplow.rdbloader."
    #   },

    #   # Optional, print metrics on stdout (with slf4j)
    #   "stdout": {
    #     # Optional, override the default metric prefix
    #     # "prefix": "snowplow.rdbloader."
    #   }
    # }
  }
}

my resolver.json (is it necessary?)

{
    "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
    "data": {
      "cacheSize": 500,
      "repositories": [
        {
          "name": "Iglu Central",
          "priority": 0,
          "vendorPrefixes": [ "com.snowplowanalytics" ],
          "connection": {
            "http": {
              "uri": "{{uri_to_my_iglu_server}}.elasticbeanstalk.com:8080/api/",
              "apikey": "{{my api key}}"
            }
          }
        }
      ]
    }
  }

my docker-compose.yaml to run it on elastic beanstalk:

services:
    rdb-loader:
      container_name: rdb-loader
      image: snowplow/snowplow-rdb-loader:1.1.0
      command:  [
                    "--config", "{{config.hocon base64 encoded}}",
                    "--iglu-config", "{{resolver base64 encoded}}",
                ]

and I got this error while deploying:

Attaching to rdb-loader
rdb-loader    | [ioapp-compute-0] WARN io.sentry.dsn.Dsn - *** Couldn't find a suitable DSN, Sentry operations will do nothing! See documentation: https://docs.sentry.io/clients/java/ ***
rdb-loader    | [ioapp-compute-0] WARN io.sentry.DefaultSentryClientFactory - No 'stacktrace.app.packages' was configured, this option is highly recommended as it affects stacktrace grouping and display on Sentry. See documentation: https://docs.sentry.io/clients/java/config/#in-application-stack-frames
rdb-loader    | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Sentry has been initialised at
rdb-loader    | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - RDB Loader 1.1.0 [Torre Redshift Loader] has started. Listening snowplow-rdb-loader-queue.fifo
rdb-loader    | [ioapp-compute-0] ERROR com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Loader shutting down
rdb-loader    | java.lang.NullPointerException
rdb-loader    | 	at com.amazon.redshift.core.jdbc42.S42NotifiedConnection.setAutoCommit(Unknown Source)
rdb-loader    | 	at doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$setAutoCommit$1(kleisliinterpreter.scala:800)
rdb-loader    | 	at doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$setAutoCommit$1$adapted(kleisliinterpreter.scala:800)
rdb-loader    | 	at doobie.free.KleisliInterpreter.$anonfun$primitive$2(kleisliinterpreter.scala:109)
rdb-loader    | 	at blockOn$extension @ doobie.free.KleisliInterpreter.$anonfun$primitive$1(kleisliinterpreter.scala:112)
rdb-loader    | 	at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
rdb-loader    | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader    | 	at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
rdb-loader    | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader    | 	at bracketCase @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
rdb-loader    | 	at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
rdb-loader    | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader    | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader    | 	at bracketCase @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
rdb-loader    | 	at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
rdb-loader    | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
rdb-loader    | 	at use @ com.snowplowanalytics.snowplow.rdbloader.Main$.run(Main.scala:36)
rdb-loader    | [cats-effect-blocker-0] INFO org.http4s.client.PoolManager - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
rdb-loader exited with code 1

any help would be truly appreciate! Thanks!

Hi @rmichaelvp,

Sorry, cannot help you with ElasticBeanstalk, we run RDB Loader mostly on AWS Fargate, but your error indicates that you’re very close to load the data.

Despite a slightly cryptic error message (we’re working on it) it seems that the problem is simply that it cannot acquire a connection (setAutoCommit as a first thing). I’d advice you to:

  1. Check your host host - it should not contain something like jdbc: or :5439. Just a plain host
  2. Check your permissions - perhaps your EB node (or whatever it’s called) cannot connect to the redshift
1 Like

Thanks @anton! I’ll check it out and let you know.

Update here:

  1. Checked the host and it’s the same I use in my database manager.
  2. I know it’s a bad practice but even when I allow all connections drops the same error log, even tried with jdbc.ssl = false.

Do you think i might be for the roleArn?

Hi @anton it looks like it was a arn role permission issue and now it’s connecting successfully, but now i got another issue that i don’t really know what’s going.

Could you please take a look?

Attaching to snowplow-rdb-loader
snowplow-rdb-loader | [ioapp-compute-0] WARN io.sentry.DefaultSentryClientFactory - No 'stacktrace.app.packages' was configured, this option is highly recommended as it affects stacktrace grouping and display on Sentry. See documentation: https://docs.sentry.io/clients/java/config/#in-application-stack-frames
snowplow-rdb-loader | [ioapp-compute-0] WARN io.sentry.DefaultSentryClientFactory - No 'stacktrace.app.packages' was configured, this option is highly recommended as it affects stacktrace grouping and display on Sentry. See documentation: https://docs.sentry.io/clients/java/config/#in-application-stack-frames
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Sentry has been initialised at https://db2f4fc08db84248a30fd54f91f2f128@o188050.ingest.sentry.io/1285240
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - RDB Loader 1.1.0 [Torre Redshift Loader] has started. Listening rdb-loader.fifo
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Received new message. Total 1 messages received, 0 loaded, 0 attempts has been made to load current folder
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - New data discovery at run=2021-07-03-05-19-33 with following shredded types:
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.snowplow/atomic/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:org.w3/PerformanceTiming/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.monitoring.batch/load_failed/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.google.analytics/cookies/jsonschema/1-*-* TSV
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Creating atomic.com_snowplowanalytics_snowplow_web_page_1 table for iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Table created
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Could not load a folder (base s3://torre-snowplow-enriched/shredded/run=2021-07-03-05-19-33/), trying to ack the SQS command
snowplow-rdb-loader | [ioapp-compute-1] ERROR com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Loader shutting down
snowplow-rdb-loader | doobie.util.invariant$NonNullableColumnRead: SQL `NULL` read at column 1 (JDBC type Char) but mapping is to a non-Option type; use Option here. Note that JDBC column indexing is 1-based.
snowplow-rdb-loader | 	at doobie.util.Get.unsafeGetNonNullable(get.scala:29)
snowplow-rdb-loader | 	at doobie.util.Read$.$anonfun$fromGet$1(read.scala:74)
snowplow-rdb-loader | 	at doobie.util.Read$.$anonfun$fromGet$1$adapted(read.scala:74)
snowplow-rdb-loader | 	at doobie.util.Read.$anonfun$get$1(read.scala:55)
snowplow-rdb-loader | 	at doobie.free.KleisliInterpreter.$anonfun$primitive$2(kleisliinterpreter.scala:109)
snowplow-rdb-loader | 	at blockOn$extension @ doobie.free.KleisliInterpreter.$anonfun$primitive$1(kleisliinterpreter.scala:112)
snowplow-rdb-loader | 	at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
snowplow-rdb-loader | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader | 	at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
snowplow-rdb-loader | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader | 	at blockOn$extension @ doobie.free.KleisliInterpreter.$anonfun$primitive$1(kleisliinterpreter.scala:112)
snowplow-rdb-loader | 	at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
snowplow-rdb-loader | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader | 	at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
snowplow-rdb-loader | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader | 	at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
snowplow-rdb-loader | 	at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader | 	at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
snowplow-rdb-loader | [cats-effect-blocker-4] INFO org.http4s.client.PoolManager - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
snowplow-rdb-loader exited with code 1

Hi @rmichaelvp ,

Have you created events table ? Instructions can be found here.

Hi @BenB,
Yes, I created the events, manifest and even the tables that appear in the log and I’m still getting the same error.

  1. New data discovery with following shredded types: …
  2. Now doesn’t appear “Table created” instead goes to "Could not load a folder (base s3: …) Trying to ack the SQS command
  3. Loader shuts down.
  4. SQL ‘null’
Attaching to snowplow-rdb-loader
snowplow-rdb-loader | [ioapp-compute-0] WARN io.sentry.DefaultSentryClientFactory - No 'stacktrace.app.packages' was configured, this option is highly recommended as it affects stacktrace grouping and display on Sentry. See documentation: https://docs.sentry.io/clients/java/config/#in-application-stack-frames
snowplow-rdb-loader | [ioapp-compute-0] WARN io.sentry.DefaultSentryClientFactory - No 'stacktrace.app.packages' was configured, this option is highly recommended as it affects stacktrace grouping and display on Sentry. See documentation: https://docs.sentry.io/clients/java/config/#in-application-stack-frames
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Sentry has been initialised at https://db2f4fc08db84248a30fd54f91f2f128@o188050.ingest.sentry.io/1285240
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - RDB Loader 1.1.0 [ Redshift Loader] has started. Listening rdb-loader.fifo
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Received new message. Total 1 messages received, 0 loaded, 0 attempts has been made to load current folder
snowplow-rdb-loader | [ioapp-compute-0] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - New data discovery at run=2021-07-06-09-50-51 with following shredded types:
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.snowplow/atomic/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:org.w3/PerformanceTiming/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.monitoring.batch/load_failed/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-*-* TSV
snowplow-rdb-loader |   * iglu:com.google.analytics/cookies/jsonschema/1-*-* TSV
snowplow-rdb-loader | [ioapp-compute-1] INFO com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Could not load a folder (base s3://snowplow-enriched/shredded/run=2021-07-06-09-50-51/), trying to ack the SQS command
snowplow-rdb-loader | [ioapp-compute-1] ERROR com.snowplowanalytics.snowplow.rdbloader.dsl.Logging.$anon - Loader shutting down
snowplow-rdb-loader | doobie.util.invariant$NonNullableColumnRead: SQL `NULL` read at column 1 (JDBC type Char) but mapping is to a non-Option type; use Option here. Note that JDBC column indexing is 1-based.
snowplow-rdb-loader |    at doobie.util.Get.unsafeGetNonNullable(get.scala:29)
snowplow-rdb-loader |    at doobie.util.Read$.$anonfun$fromGet$1(read.scala:74)
snowplow-rdb-loader |    at doobie.util.Read$.$anonfun$fromGet$1$adapted(read.scala:74)
snowplow-rdb-loader |    at doobie.util.Read.$anonfun$get$1(read.scala:55)
snowplow-rdb-loader |    at doobie.free.KleisliInterpreter.$anonfun$primitive$2(kleisliinterpreter.scala:109)
snowplow-rdb-loader |    at blockOn$extension @ doobie.free.KleisliInterpreter.$anonfun$primitive$1(kleisliinterpreter.scala:112)
snowplow-rdb-loader |    at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
snowplow-rdb-loader |    at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader |    at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
snowplow-rdb-loader |    at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader |    at blockOn$extension @ doobie.free.KleisliInterpreter.$anonfun$primitive$1(kleisliinterpreter.scala:112)
snowplow-rdb-loader |    at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
snowplow-rdb-loader |    at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader |    at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
snowplow-rdb-loader |    at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader |    at $anonfun$tailRecM$1 @ doobie.util.transactor$Transactor$$anon$4.$anonfun$apply$4(transactor.scala:167)
snowplow-rdb-loader |    at tailRecM @ retry.package$RetryingOnSomeErrorsPartiallyApplied.apply(package.scala:96)
snowplow-rdb-loader |    at $anonfun$tailRecM$1 @ doobie.free.KleisliInterpreter$ConnectionInterpreter.$anonfun$bracketCase$28(kleisliinterpreter.scala:750)
snowplow-rdb-loader | [ioapp-compute-1] WARN io.sentry.connection.AsyncConnection - Graceful shutdown took too much time, forcing the shutdown.
snowplow-rdb-loader | [ioapp-compute-1] WARN io.sentry.connection.AsyncConnection - 0 tasks failed to execute before shutdown.
snowplow-rdb-loader | [cats-effect-blocker-3] INFO org.http4s.client.PoolManager - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
snowplow-rdb-loader exited with code 1

In the role for the Redshift DB has SQS, s3 and Redshift permission just in case

Hi @rmichaelvp ,

Could you share the content of run=2021-07-06-09-50-51/shredding_complete.json (in shredded output) and check that each subfolder in output=good/ contains data please ?