Steps Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3

I am facing below error while executing snow plow command. EMR cluster spins up but fails at a certain step. how do i resolve this issue ?
Command executed : ./snowplow-emr-etl-runner run -c /etl_runner/config/snowplow_config.yml -r /etl_runner/config/iglu_resolver.json -t /etl_runner/targets/ --debug --resume-from archive_raw

Logs

D, [2020-01-08T05:57:18.234000 #26642] DEBUG – : Initializing EMR jobflow
D, [2020-01-08T05:57:22.599000 #26642] DEBUG – : EMR jobflow j-DjkuhRY3Z started, waiting for jobflow to complete…
I, [2020-01-08T06:05:25.679000 #26642] INFO – : No RDB Loader logs
F, [2020-01-08T06:05:26.141000 #26642] FATAL – :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-DjkuhRY3Z failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.
Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2020-01-08 06:03:08 UTC - ]

    1. Elasticity S3DistCp Step: Shredded S3 -> S3 Shredded Archive: CANCELLED ~ elapsed time n/a [1970-01-01 00:00:00 UTC - ]
    1. Elasticity S3DistCp Step: Enriched S3 -> S3 Enriched Archive: CANCELLED ~ elapsed time n/a [1970-01-01 00:00:00 UTC - ]
    1. Elasticity Custom Jar Step: Load PostgreSQL enriched events storage Storage Target: CANCELLED ~ elapsed time n/a [1970-01-01 00:00:00 UTC - ]
    1. Elasticity Setup Hadoop Debugging: COMPLETED ~ 00:00:06 [2020-01-08 06:03:08 UTC - 2020-01-08 06:03:14 UTC]
    1. Elasticity S3DistCp Step: Raw Staging S3 -> Raw Archive S3: FAILED ~ 00:00:02 [2020-01-08 06:03:16 UTC - 2020-01-08 06:03:19 UTC]):
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:659:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:109:in run' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_reference.rb:43:in send_to’
      uri:classloader:/gems/contracts-0.11.0/lib/contracts/call_with.rb:76:in call_with' uri:classloader:/gems/contracts-0.11.0/lib/contracts/method_handler.rb:138:in block in redefine_method’
      uri:classloader:/emr-etl-runner/bin/snowplow-emr-etl-runner:41:in <main>' org/jruby/RubyKernel.java:979:in load’
      uri:classloader:/META-INF/main.rb:1:in <main>' org/jruby/RubyKernel.java:961:in require’
      uri:classloader:/META-INF/main.rb:1:in (root)' uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1:in

Hi @Vraj,

As far as I know, there can be 2 main reasons of similar issue. First is trivial - your EMR cluster has no sufficient privileges to mangel files over S3. This should be easy to debug (just verify roles). Moreover, if EMR worked and suddenly stopped, this is not the point.

The other reason is number and size of files - I was facing similar issue, when number of events in real time pipeline passed 5M/batch. In deep EMR logs I was able to find “S3 Throughput exceeded - Slow Down” errors. Reason behind was to big number of small files. Temporary workaround was to increase Stream to S3 storage file size. Final solution was to move to Dataflow Runner, where I could control S3 copying strategy (I started joining files to get something around 30 MB). Positive side effect was reduction of Shredding process form almost almost 2,5h to less than 20 minutes :wink:

Cheers,
GE

Hi @grzegorzewald : EMR has the privileges to manage the S3 files as the roles are the default AWS managed policy which has s3 access as required.
2. The Second reason which you mentioned regarding the size.Currently the size of files is not more. Its in Kb’s. There are only few files in KB’s . Will this cause the step to fail? I have same setup in another account working fine as expected.

Issue is fixed. I had to manually add the Default EMR role in AWS. Also there was some absolute s3 path missing which was causing issue.

That issue is resolved but new one cropped up. I have an RDS as a target. the rdb_loader works fine and cluster is successfully terminated. but the data is not loaded inside the database.
On checking the cluster logs for rdb, it says " RDB Loader successfully completed following steps: [Discover, Analyze]"

D, [2020-01-08T10:00:48.705000 #27212] DEBUG – : Initializing EMR jobflow
D, [2020-01-08T10:00:55.884000 #27212] DEBUG – : EMR jobflow j-1gfgdgh1R3JZ5B started, waiting for jobflow to complete…
I, [2020-01-08T10:16:59.866000 #27212] INFO – : RDB Loader logs
D, [2020-01-08T10:16:59.871000 #27212] DEBUG – : Downloading s3://snowplow-data/etl_logs/rdb-loader/2020-01-08-10-00-48/1e2302c8b2gbnhf7c-2192-4lljjkljyyu62e-8a2 to /tmp/rdbloader20201023530108-27212-12w4n00001475
I, [2020-01-08T10:17:02.428000 #27212] INFO – : PostgreSQL enriched events storage
I, [2020-01-08T10:17:02.429000 #27212] INFO – : RDB Loader successfully completed following steps: [Discover, Analyze]
D, [2020-01-08T10:17:02.430000 #27212] DEBUG – : EMR jobflow j-1gfgdgh1R3JZ5B completed successfully.
I, [2020-01-08T10:17:02.431000 #27212] INFO – : Completed successfully

Just to precise:
Size of files is not the issue - issue is in number of files - from the point of S3 it is better to mangle limited number of large files than preponderance of small ones.

Ok understood @grzegorzewald , thanks.
Can you also help me with the new issue. The command run successfully , but the data did not get loaded in the postgres database. There is no such failuer logs generated.

You need to verify your archive s3 folder than - most likely all data falls into bad bucket during enrichment.

I have checked the folders. I can see the files there. The Problem is my Postgres db data loading step is getting executed successfully ,but when i check my data in database, it isn’t loaded in there.
I dont think there will be any network related issue with database else it should failed in the step itself.
Any leads on how to check or resolve this issue ?
It will be much appreciated if some one can help out here.

The first think to check s to verify, if get info of successfully loaded 0 events of 0 good or 0 of any number of good.
And yes - you are correct - if there were any network issues, it would fail rdb_laod.

Hey @grzegorzewald where do i checked that info of successfully loaded events here in ?

In Archive S3 bucket, you should have good ad bad events per run. Verify if there are any events in good.

Ok sure … I can see files ,but few are empty and in enriched folder in bad folder there are few files. I am still looking for solution on why the data is not been loaded in to the database every time I execute snowplow successfully.

Hello @Vraj,

We’ll be looking into this issue, but as an important side-note, Postgres support in RDB Loader was always experimental. We do plan to support it properly, but it wasn’t the case until the moment. Even if we manage to resolve this problem - only your events table will be populated, no contexts nor self-describing events will be loaded, and these entities in 99% is what our users need. Therefore I encourage you to consider Redshift (or Snowflake or BigQuery).