Elasticity Spark Step: Enrich Raw Events never ends

I stuck in there for a looong time

on the console I see

root@snowplow:~/emr-etl# ./snowplow-emr-etl-runner --config config/config.yml --targets config/targets/ --resolver resolver.json --enrichments file:./enrichments
D, [2017-08-16T19:19:06.674000 #7598] DEBUG -- : Staging raw logs...
  moving files from s3://elasticbeanstalk-us-east-1-768491813393/resources/environments/logs/publish/e-gepyppfmhg/ to s3://ntech-snoplow-data/processing/
(t0)    MOVE elasticbeanstalk-us-east-1-768491813393/resources/environments/logs/publish/e-gepyppfmhg/i-03c30b5f5f2a0d9e9/_var_log_tomcat8_rotated_localhost_access_log.txt1502910061.gz -> ntech-snoplow-data/processing/var_log_tomcat8_rotated_localhost_access_log.2017-08-16-19.us-east-1.i-03c30b5f5f2a0d9e9.txt.gz
      +-> ntech-snoplow-data/processing/var_log_tomcat8_rotated_localhost_access_log.2017-08-16-19.us-east-1.i-03c30b5f5f2a0d9e9.txt.gz
      x elasticbeanstalk-us-east-1-768491813393/resources/environments/logs/publish/e-gepyppfmhg/i-03c30b5f5f2a0d9e9/_var_log_tomcat8_rotated_localhost_access_log.txt1502910061.gz
D, [2017-08-16T19:19:15.482000 #7598] DEBUG -- : Waiting a minute to allow S3 to settle (eventual consistency)
D, [2017-08-16T19:20:15.487000 #7598] DEBUG -- : Initializing EMR jobflow
D, [2017-08-16T19:20:17.956000 #7598] DEBUG -- : EMR jobflow j-3EYT62SEUUZQN started, waiting for jobflow to complete...
I, [2017-08-16T19:20:17.976000 #7598]  INFO -- : SnowplowTracker::Emitter initialized with endpoint http://collector.namastetech.me:80/i
I, [2017-08-16T19:20:18.342000 #7598]  INFO -- : Attempting to send 1 request
I, [2017-08-16T19:20:18.346000 #7598]  INFO -- : Sending GET request to http://collector.namastetech.me:80/i...
I, [2017-08-16T19:20:18.391000 #7598]  INFO -- : GET request to http://collector.namastetech.me:80/i finished with status code 200

stderr logs shows lines like

17/08/16 19:31:47 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:48 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:50 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:51 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:52 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:53 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:54 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:55 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:56 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:57 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:58 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:31:59 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)
17/08/16 19:32:00 INFO Client: Application report for application_1502911438395_0001 (state: RUNNING)

and it never ends

I barely have few tomcat logs to process so not sure what could be happening

any ideas?

thanks in advance

@jpdc

Which instance type are you using?

Saw in the other thread that you’re using m1.medium.

This instance type doesn’t seem to work for Spark most likely because of the number of virtual CPUs (1).

What’s puzzling is that YARN still accepts the request for resources.

Core node instance must be m4.2Xlarge are greater than m1.medium atleast

I’m seeing the same symptom (neverending spark enrichment), and I’m using 1x c3.xlarge to process ~ 1.5GB collector logs. CPU and disk utilization on the EMR instances is nearly zero after a brief burst at the job start. I’m skeptical that it could be an issue with the instance type, but I’m at a loss for other explanations.

@alex / team-snowplow – any suggestions for the above? I’m running EmrEtlRunner r97, and I’ve tried both spark_enrich 1.11.0 and 1.12.0.

This is in furtherance of upgrading from r77, which stopped working a while back due to bitrot of at least one remote dependency jar required by the EMR bootstrap.

@alexc-sigfig,

Not sure that it’s the same. In @jpdc example, I can see he ran EmrEtlRunner without run command which is an indication of the app prior to R91 when it was introduced.

We could also observe an ever running enrichment but it’s rather an exceptional case (happens very rear). Not much could be done but have the job terminated and resumed from the appropriate step.

For 1.5GB compressed logs, we use up to 1 x r4.8xlarge core instance; if that’s the size of uncompressed logs 1 x r4.xlarge would be sufficient.

In either case, a fine-tuned Spark settings are required for improved performance.

I don’t believe what I’m seeing is strictly a performance issue; if c3.xlarge is running out of memory and the job is dying, disappointing that EMR can’t tell. htop shows all processes but itself as suspended (I’ve also strace'd looking for activity without success). I’ll try with a larger instance, though… :crossed_fingers:

r4.8xlarge progressed better, so appears that hang was due to under-resourcing. Unfortunately, it didn’t succeed, but that’s an issue for another thread. Thanks for the tip, @ihor.