EmrEtlRunner not loading data into RedShift

@opethian, do you get any shredded data at all? If no shredded data produced there’s nothing to load to Redshift.

Yes I’m seeing the shredded data.

Staging_Stream_Enrich Step

  • S3 bucket snowplow-stream-enriched has data

  • S3 bucket snowplow-stream-enriched bucket has data

  • Shredding Step

    • S3 bucket snowplow-emr-shredded-good has data in run=2019-06-24-16-45-20/atomic-events.

    • S3 bucket snowplow-emr-shredded-good has data in run=2019-06-24-16-45-20/shredded_types.

    • S3 bucket snowplow-emr-shredded-bad has data in run=2019-06-24-16-45-20/part-00000-REDACATED.txt file that is empty.

    • S3 bucket snowplow-emr-shredded-bad has _SUCCESS file that is empty.

  • RDB_Load

    • RDB Loader runs successfully with this message. “RDB Loader successfully completed following steps: [Discover]”
  • Archive Enriched

    • S3 bucket snowplow-emr-enriched-archive has data in run=2019-06-24-16-45-20.
  • Archive Shredded

    • S3 bucket snowplow-emr-shredded-archive has data in run=2019-06-24-16-45-20/atomic-events.

@opethian, I cannot see anything wrong. The only thing I spotted is the schema for the target configuration. Is your schema really snowplow.events and not just snowplow?

@opethian,

Is Redshift able to read data from your S3 bucket (role/credentials)?

@ihor Good catch. My schema isn’t snowplow.events just snowplow, however I still don’t see any loading of data into RedShift.

As @grzegorzewa asked if RedShift has the proper roles/credentials to read from S3. I’m currently setting up and running everything as AdminAdministrator, which is really bad, until I get everything in Snowplow setup.

My next question is where would I look for errors from AWS when there is a lack of credentials or roles?

@opethian I am running into the same problem you are… Were you able to figure out the solution? It seems like the issue might be that the shredded events are empty. and in the docs it says the loader will only pick up non empty shredded events… I don’t know why they are empty though.

@darussian No I’m still experiencing the same problem. Where the RDB Loader successfully completed following steps: [Discover] but no data is transported into RedShift.

I have gone down the road of permissions, which I think is a possibility to complete bafflement.

@ihor based on the schema you mentioned, I created events table in redshift under schema “atomic”.
So I should input “atomic.events” in the redshift.json?

thanks!

@AllenWeieiei, no, it is always events table for core Snowplow data. Thus, the target configuration file has schema property for the schema name. In your case, it will be just “atomic”.

Hi, @ihor

I think I am facing some similar issue like this. Without redshift.json, everything looks good. I got files in shred bucket and the runner completed successfully.

After adding the redshift.json and change the command, looks like the files in S3 Loader bucket were shredded good (I can find them in S3 shred bucket under good folder, and all subfolders are created successfully). But the RDB Loader is not working properly. I got error as following: packet_write_wait: Connection to 18.205.53.235 port 2225: Broken pipe

I will lose SSH connection to the server after running the runner for sometime. Do you have any advice?

Thanks!

@ihor
looks like it was fixed after adding server alive interval.

I got a new error:
F, [2019-10-25T18:32:00.513819 #12727] FATAL – :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-31G1GGSLCYVQY failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
j-31G1GGSLCYVQY: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2019-10-25 18:19:06 +0000 - ]

    1. [staging_stream_enrich] s3-dist-cp: Stream Enriched s3://enrich-s3-loader/ -> Enriched Staging S3: COMPLETED ~ 00:02:02 [2019-10-25 18:19:08 +0000 - 2019-10-25 18:21:11 +0000]
    1. [shred] s3-dist-cp: Enriched S3 -> HDFS: COMPLETED ~ 00:01:50 [2019-10-25 18:21:13 +0000 - 2019-10-25 18:23:03 +0000]
    1. [shred] spark: Shred Enriched Events: COMPLETED ~ 00:03:16 [2019-10-25 18:23:05 +0000 - 2019-10-25 18:26:22 +0000]
    1. [shred] s3-dist-cp: Shredded HDFS -> S3: COMPLETED ~ 00:01:38 [2019-10-25 18:26:24 +0000 - 2019-10-25 18:28:02 +0000]
    1. [shred] s3-dist-cp: Shredded HDFS _SUCCESS -> S3: COMPLETED ~ 00:01:24 [2019-10-25 18:28:04 +0000 - 2019-10-25 18:29:28 +0000]
    1. [rdb_load] Load SendGrid-Target Storage Target: FAILED ~ 00:01:09 [2019-10-25 18:29:35 +0000 - 2019-10-25 18:30:44 +0000]
    1. [archive_shredded] s3-dist-cp: Shredded S3 -> Shredded Archive S3: CANCELLED ~ elapsed time n/a [ - ]
    1. [archive_enriched] s3-dist-cp: Enriched S3 -> Enriched Archive S3: CANCELLED ~ elapsed time n/a [ - ]):
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:799:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:138:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in <main>' org/jruby/RubyKernel.java:994:in load’
      uri:classloader:/META-INF/main.rb:1:in <main>' org/jruby/RubyKernel.java:970:in require’
      uri:classloader:/META-INF/main.rb:1:in (root)' uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in

Does this mean I need to initial log for hadoop?

@AllenWeieiei, as the logs suggest the failure took place at loading data to Redshift. You need to investigate it further. I suspect your permissions are not set up correctly. Check EMR logs and see if more info on failure could be found.

thank you so much!


Hi @ihor,

When I check the log as the instruction said, the job Shutdown as step failed. Screenshot showing the loading step was failed, and I check the syslog of it. It has only a WARN there about crediential. Do you have any idea about this? In my redshift json, username and password is the master user of the target cluster and the roleARN is pointing to a role which has full access to my S3 and Redshift.

The WARN is showing I need to move a file, could you please provide more info about it?

Much appreciated!

@ihor

The WARN is showing a conf file at some specific location, is that location created by EmrEtlRunner? Is it configurable on my end?

Thanks!

@AllenWeieiei, that warning is not Snowplow related. It is related to how you use AWS CLI to run the pipeline. Indeed, the AWS CLI expects your credentials file to be located at ~/.aws/credentials: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.

Having said that it is just a warning and the job does run. You need to find out why it fails at the data load step.

OK, thanks!

@ihor, when I check the article about redshift connection, I have a question. Even I don’t use a SQLWorkbench to connect with it, does the server (where I am running the collector, enrich, loader, runner) require a driver for JDBC to be installed?

@ihor, hello.

I got the process working and data loaded into Redshift. I also set up the supervisor of collector and enrich to run, now I have data in collector stream and enrich stream whenever the new data is coming. When I check the instruction, I only found there are some steps about scheduling the EmrEtlRunner. So I would believe the EmrEtlRunner will be scheduled to run periodically.

How about the S3 Loader? schedule or other setting?

Thanks!

@AllenWeieiei, indeed, EmrEtlRunner needs to be scheduled and we just set up cron ourselves but you can use whatever is more appropriate for you.

S3 Loader is a real-time component - subscriber to Kinesis stream and as such it is “always-on” once started. You do need to configure when the streamed events are uploaded to S3 for batch processing (shredding and loading to Redshift). It is achieved with these properties in S3 Loader configuration file.

@ihor

Hi @ihor,

When I work with the DevOp team,we want to minimize the resources, risk, etc… A question of the process that we have right now is: What do these 2 enrich(stream enrich after collector, EmrEtlRunner enrich) steps really do in the process? Are they adding new fields? Increasing amount of records?

Thanks!