EmrEtlRunner not loading data into RedShift

Hi. I’m having issues getting data loaded into RedShift.

I’m using Scala Collect, Scala Stream Enricher, S3 loader and the EmrEtlRunner (version 0.34.2). The hits are appearing in the appropriate S3 bucket for the EmrEtlRunner to process, however after the EmrEtlRunner process runs no data is loaded into RedShift and no errors are logged or displayed.

The command I’m using to start the EmrEtlRunner process:

$ ./snowplow-emr-etl-runner run -c /home/ubuntu/configs/config_emr_etl_runner.yml -r resolver.js -t /home/ubuntu/targets

Output from the EmrEtlRunner command:

uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
uri:classloader:/gems/json-schema-2.7.0/lib/json-schema/util/array_set.rb:18: warning: constant ::Fixnum is deprecated
D, [2019-06-20T15:20:17.002891 #10172] DEBUG – : Initializing EMR jobflow
D, [2019-06-20T15:20:19.525834 #10172] DEBUG – : EMR jobflow j-16QCOF4O410G0 started, waiting for jobflow to complete…
I, [2019-06-20T16:00:28.183705 #10172] INFO – : RDB Loader logs
D, [2019-06-20T16:00:28.195534 #10172] DEBUG – : Downloading s3://snowplow-emr-log/rdb-loader/2019-06-20-15-20-17/fcb2400a-2fc6-40dd-9254-e28f7a6e8275 to /tmp/rdbloader20190620-10172-jzwhho
I, [2019-06-20T16:00:28.261527 #10172] INFO – : AWS Redshift enriched events storage
I, [2019-06-20T16:00:28.271089 #10172] INFO – : RDB Loader successfully completed following steps: [Discover]
D, [2019-06-20T16:00:28.464189 #10172] DEBUG – : EMR jobflow j-16QCOF4O410G0 completed successfully.
I, [2019-06-20T16:00:28.472809 #10172] INFO – : Completed successfully

The EmrEtlRunner processes successfully but no data is sent to RedShift and I never seen this message “RDB Loader successfully completed following steps: [Discover, Load, Analyze]” only this message “RDB Loader successfully completed following steps: [Discover].”

My configuration for the EMR ETL Runner:

aws:
  access_key_id: "** redacted **"
  secret_access_key: "** redacted **"
  s3:
    region: "us-west-2"
    buckets:
      assets: s3://snowplow-hosted-assets
      jsonpath_assets:
      log: "s3://snowplow-emr-log"
      encrypted: false
      enriched:
        good: "s3://snowplow-emr-enriched-good"
        archive: "s3://snowplow-emr-enriched-archive"
        stream: "s3://snowplow-stream-enriched"
      shredded:
        good: "s3://snowplow-emr-shredded-good"
        bad: "s3://snowplow-emr-shredded-bad"
        errors:
        archive: "s3://snowplow-emr-shredded-archive"
    consolidate_shredded_output: false
  emr:
      ami_version: 5.9.0
      region: "us-west-2"
      jobflow_role: EMR_EC2_DefaultRole
      service_role: EMR_DefaultRole
      placement: "us-west-2b"
      ec2_subnet_id:
      ec2_key_name: "snowplow"
      security_configuration:
      bootstrap: []
      software:
        hbase:
        lingual:
      # Adjust your Hadoop cluster below
      jobflow:
        job_name: Snowplow ETL
        master_instance_type: m1.medium
        core_instance_count: 2
        core_instance_type: m1.medium
        core_instance_ebs:
          volume_size: 50
          volume_type: gp2
          volume_iops: 400
          ebs_optimized: false
        task_instance_count: 0
        task_instance_type: m1.medium
        task_instance_bid: 0.015
      bootstrap_failure_tries: 3
      configuration:
        yarn-site:
          yarn.resourcemanager.am.max-attempts: "1"
        spark:
          maximizeResourceAllocation: "true"
      additional_info:
  collectors:
      format: "thrift"
  enrich:
      versions:
        spark_enrich: 1.18.0
      continue_on_unexpected_error: false
      output_compression: GZIP
    storage:
      versions:
        rdb_loader: 0.14.0
        rdb_shredder: 0.13.0
        hadoop_elasticsearch: 0.1.0
    monitoring:
      tags: {}
      logging:
        level: DEBUG

My targets configuration file:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0",
    "data": {
        "name": "AWS Redshift enriched events storage",
        "host": "** redacted **",
        "database": "sandbox",
        "port": 5439,
        "sslMode": "DISABLE",
        "username": "** redacted **",
        "password": "** redacted **",
        "roleArn": "arn:aws:iam::** redacted **:role/RedshiftLoadRole",
        "schema": "snowplow.events",
        "maxError": 1,
        "compRows": 20000,
        "sshTunnel": null,
        "purpose": "ENRICHED_EVENTS"
    }
}

Thank you in advance for looking at this issue.

@opethian, do you get any shredded data at all? If no shredded data produced there’s nothing to load to Redshift.

Yes I’m seeing the shredded data.

Staging_Stream_Enrich Step

  • S3 bucket snowplow-stream-enriched has data

  • S3 bucket snowplow-stream-enriched bucket has data

  • Shredding Step

    • S3 bucket snowplow-emr-shredded-good has data in run=2019-06-24-16-45-20/atomic-events.

    • S3 bucket snowplow-emr-shredded-good has data in run=2019-06-24-16-45-20/shredded_types.

    • S3 bucket snowplow-emr-shredded-bad has data in run=2019-06-24-16-45-20/part-00000-REDACATED.txt file that is empty.

    • S3 bucket snowplow-emr-shredded-bad has _SUCCESS file that is empty.

  • RDB_Load

    • RDB Loader runs successfully with this message. “RDB Loader successfully completed following steps: [Discover]”
  • Archive Enriched

    • S3 bucket snowplow-emr-enriched-archive has data in run=2019-06-24-16-45-20.
  • Archive Shredded

    • S3 bucket snowplow-emr-shredded-archive has data in run=2019-06-24-16-45-20/atomic-events.

@opethian, I cannot see anything wrong. The only thing I spotted is the schema for the target configuration. Is your schema really snowplow.events and not just snowplow?

@opethian,

Is Redshift able to read data from your S3 bucket (role/credentials)?

@ihor Good catch. My schema isn’t snowplow.events just snowplow, however I still don’t see any loading of data into RedShift.

As @grzegorzewa asked if RedShift has the proper roles/credentials to read from S3. I’m currently setting up and running everything as AdminAdministrator, which is really bad, until I get everything in Snowplow setup.

My next question is where would I look for errors from AWS when there is a lack of credentials or roles?

@opethian I am running into the same problem you are… Were you able to figure out the solution? It seems like the issue might be that the shredded events are empty. and in the docs it says the loader will only pick up non empty shredded events… I don’t know why they are empty though.