EMR runnning slow on m5.xlarge AMI 6.5.0 Shredder 2.1.0

Hi,

we are facing some problems with the shredder again. We recently updated to shredder and rdbloader 2.1.0 and also updated the AMI version to 6.5.0 because that was a requirement from our IT security.

The shredder jobs usually runs for about 20-30 minutes. At some point it starts slowing down considerably (see screenshot).

The influx of data is constant and rather small though less than 1GB per day).

Our current emr-config looks like this:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
  "data": {
    "name": "com.oneapp",
    "logUri": "LOGURI",
    "region": "eu-west-1",
    "credentials": {
      "accessKeyId": "AWS_ACCESS_KEY_ID",
      "secretAccessKey": "AWS_SECRET_ACCESS_KEY"
    },
    "roles": {
      "jobflow": "EMR_EC2_DefaultRole",
      "service": "EMR_DefaultRole"
    },
    "ec2": {
      "amiVersion": "6.5.0",
      "instances": {
          "core": {
              "count": 1,
              "type": "r5.xlarge"
          },
          "master": {
              "ebsConfiguration": {
                  "ebsBlockDeviceConfigs": [],
                  "ebsOptimized": true
              },
              "type": "m5.xlarge"
          },
          "task": {
              "bid": "0.015",
              "count": 0,
              "type": "m5.xlarge"
          }
      },
      "keyName": "EMR_ECS_KEY_PAIR",
      "location": {
          "vpc": {
              "subnetId": "AWS_SUBNET_PUBLIC_ID"
          }
      }
    },
    "tags": [
      {
        "key": "client",
        "value": "com.oneapp"
      },
      {
        "key": "job",
        "value": "main"
      },
      {
        "key": "GITC-VulnScanTool",
        "value": "tenable_io"
      }
    ],
    "bootstrapActionConfigs": [],
    "configurations": [
      {
        "classification": "spark",
        "configurations": [],
        "properties": {
            "maximizeResourceAllocation": "false"
        }
      },
      {
        "classification": "spark-defaults",
        "configurations": [],
        "properties": {
            "spark.default.parallelism": "8",
            "spark.driver.maxResultSize": "0",
            "spark.driver.cores": "1",
            "spark.driver.memory": "9G",
            "spark.dynamicAllocation.enabled": "false",
            "spark.executor.cores": "1",
            "spark.executor.instances": "2",
            "spark.executor.memory": "9G",
            "spark.yarn.driver.memoryOverhead": "1024",
            "spark.yarn.executor.memoryOverhead": "1024"
        }
      },
      {
        "classification": "yarn-site",
        "configurations": [],
        "properties": {
            "yarn.nodemanager.resource.memory-mb": "24576",
            "yarn.nodemanager.vmem-check-enabled": "false",
            "yarn.scheduler.maximum-allocation-mb": "24576"
        }
      }
    ],
    "applications": [ "Hadoop", "Spark" ]
  }
}

What causes the shredder to slow down so much?

Are you able to share the monitoring IO? It may be running our of memory or disk.

Hi Mike,

here is the IO monitor and the stderr as well as controller output:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/emr/emrfs/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/aws/redshift/jdbc/redshift-jdbc42-1.2.37.1061.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
22/02/16 05:50:03 WARN SparkConf: The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
22/02/16 05:50:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/02/16 05:50:04 WARN DependencyUtils: Skip remote jar s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar.
22/02/16 05:50:04 INFO RMProxy: Connecting to ResourceManager at ip-11-222-232-22.eu-west-1.compute.internal/11.222.232.22:8032
22/02/16 05:50:04 INFO Client: Requesting a new application from cluster with 1 NodeManagers
22/02/16 05:50:04 INFO Configuration: resource-types.xml not found
22/02/16 05:50:04 INFO ResourceUtils: Unable to find 'resource-types.xml'.
22/02/16 05:50:04 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
22/02/16 05:50:04 INFO Client: Will allocate AM container, with 10240 MB memory including 1024 MB overhead
22/02/16 05:50:04 INFO Client: Setting up container launch context for our AM
22/02/16 05:50:04 INFO Client: Setting up the launch environment for our AM container
22/02/16 05:50:04 INFO Client: Preparing resources for our AM container
22/02/16 05:50:04 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/02/16 05:50:07 INFO Client: Uploading resource file:/mnt/tmp/spark-9520bfdb-47a1-426e-b91d-1e94aee590ac/__spark_libs__1394439949027560815.zip -> hdfs://ip-11-222-232-22.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1644986589982_0002/__spark_libs__1394439949027560815.zip
22/02/16 05:50:09 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
22/02/16 05:50:09 INFO Client: Uploading resource s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar -> hdfs://ip-11-222-232-22.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1644986589982_0002/snowplow-rdb-shredder-2.1.0.jar
22/02/16 05:50:10 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar' for reading
22/02/16 05:50:11 INFO Client: Uploading resource file:/mnt/tmp/spark-9520bfdb-47a1-426e-b91d-1e94aee590ac/__spark_conf__5175405484479654466.zip -> hdfs://ip-11-222-232-22.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1644986589982_0002/__spark_conf__.zip
22/02/16 05:50:12 INFO SecurityManager: Changing view acls to: hadoop
22/02/16 05:50:12 INFO SecurityManager: Changing modify acls to: hadoop
22/02/16 05:50:12 INFO SecurityManager: Changing view acls groups to: 
22/02/16 05:50:12 INFO SecurityManager: Changing modify acls groups to: 
22/02/16 05:50:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
22/02/16 05:50:12 INFO Client: Submitting application application_1644986589982_0002 to ResourceManager
22/02/16 05:50:12 INFO YarnClientImpl: Submitted application application_1644986589982_0002
22/02/16 05:50:13 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:13 INFO Client: 
	 client token: N/A
	 diagnostics: AM container is launched, waiting for AM container to Register with RM
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1644990612187
	 final status: UNDEFINED
	 tracking URL: http://ip-11-222-232-22.eu-west-1.compute.internal:20888/proxy/application_1644986589982_0002/
	 user: hadoop
22/02/16 05:50:14 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:15 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:16 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:17 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:18 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:19 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:20 INFO Client: Application report for application_1644986589982_0002 (state: ACCEPTED)
22/02/16 05:50:21 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:21 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: ip-11-222-232-20.eu-west-1.compute.internal
	 ApplicationMaster RPC port: 43267
	 queue: default
	 start time: 1644990612187
	 final status: UNDEFINED
	 tracking URL: http://ip-11-222-232-22.eu-west-1.compute.internal:20888/proxy/application_1644986589982_0002/
	 user: hadoop
22/02/16 05:50:22 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:23 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:24 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:25 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:26 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:27 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:28 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:29 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:30 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:31 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:32 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:33 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)
22/02/16 05:50:34 INFO Client: Application report for application_1644986589982_0002 (state: RUNNING)

controller


2022-02-16T05:50:01.974Z INFO Ensure step 2 jar file command-runner.jar
2022-02-16T05:50:01.974Z INFO StepRunner: Created Runner for step 2
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.rdbloader.shredder.batch.Main --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-eu-central-1/4-storage/rdb-shredder/snowplow-rdb-shredder-2.1.0.jar --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRpY3MuaWdsdS9yZXNvbHZlci1jb25maWcvanNvbnNjaGVtYS8xLTAtMCIsCiAgImRhdGEiOiB7CiAgICAiY2FjaGVTaXplIjogNTAwLAogICAgInJlcG9zaXRvcmllcyI6IFsKICAgICAgewogICAgICAgICJuYW1lIjogIlMzLXNjaGVtYXMtcmVnaXN0cnkiLAogICAgICAgICJwcmlvcml0eSI6IDAsCiAgICAgICAgInZlbmRvclByZWZpeGVzIjogWyJjb20ub25lYXBwIl0sCiAgICAgICAgImNvbm5lY3Rpb24iOiB7CiAgICAgICAgICAiaHR0cCI6IHsKICAgICAgICAgICAgInVyaSI6ICJodHRwczovL2QzMmlwbGFubndzb3pjLmNsb3VkZnJvbnQubmV0IgogICAgICAgICAgfQogICAgICAgIH0KICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsCiAgICAgICAgInByaW9yaXR5IjogMSwKICAgICAgICAidmVuZG9yUHJlZml4ZXMiOiBbICJjb20uc25vd3Bsb3dhbmFseXRpY3MiIF0sCiAgICAgICAgImNvbm5lY3Rpb24iOiB7CiAgICAgICAgICAiaHR0cCI6IHsKICAgICAgICAgICAgInVyaSI6ICJodHRwOi8vaWdsdWNlbnRyYWwuY29tIgogICAgICAgICAgfQogICAgICAgIH0KICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCAtIE1pcnJvciAwMSIsCiAgICAgICAgInByaW9yaXR5IjogMiwKICAgICAgICAidmVuZG9yUHJlZml4ZXMiOiBbImNvbS5zbm93cGxvd2FuYWx5dGljcyJdLAogICAgICAgICJjb25uZWN0aW9uIjogewogICAgICAgICAgImh0dHAiOiB7CiAgICAgICAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSIKICAgICAgICAgIH0KICAgICAgICB9CiAgICAgIH0KICAgIF0KICB9Cn0K --config ewogICMgUGF0aCB0byBlbnJpY2hlZCBhcmNoaXZlIChtdXN0IGJlIHBvcHVsYXRlZCBzZXBhcmF0ZWx5IHdpdGggcnVuPVlZWVktTU0tREQtaGgtbW0tc3MgZGlyZWN0b3JpZXMpIGZvciBTMyBpbnB1dAogICJpbnB1dCI6ICJzM246Ly9zcC1lbnJpY2hlZC0yLTEtMC1naXRjLWxpdmUvYXJjaGl2ZS8iLAoKICAjIFBhdGggdG8gc2hyZWRkZWQgYXJjaGl2ZQogICJvdXRwdXQiOiB7CiAgICAjIFBhdGggdG8gc2hyZWRkZWQgb3V0cHV0CiAgICAicGF0aCI6ICJzM246Ly9zcC1zaHJlZGRlZC0yLTEtMC1naXRjLWxpdmUvYXJjaGl2ZS8iLAogICAgIyBTaHJlZGRlciBvdXRwdXQgY29tcHJlc3Npb24sIEdaSVAgb3IgTk9ORQogICAgIyBPcHRpb25hbCwgZGVmYXVsdCB2YWx1ZSBHWklQCiAgICAiY29tcHJlc3Npb24iOiAiR1pJUCIsCiAgICAjIFRoaXMgZmllbGQgaXMgb3B0aW9uYWwgaWYgaXQgY2FuIGJlIHJlc29sdmVkIHdpdGggQVdTIHJlZ2lvbiBwcm92aWRlciBjaGFpbi4KICAgICMgSXQgY2hlY2tzIHBsYWNlcyBsaWtlIGVudiB2YXJpYWJsZXMsIHN5c3RlbSBwcm9wZXJ0aWVzLCBBV1MgcHJvZmlsZSBmaWxlLgogICAgIyBodHRwczovL3Nkay5hbWF6b25hd3MuY29tL2phdmEvYXBpL2xhdGVzdC9zb2Z0d2FyZS9hbWF6b24vYXdzc2RrL3JlZ2lvbnMvcHJvdmlkZXJzL0RlZmF1bHRBd3NSZWdpb25Qcm92aWRlckNoYWluLmh0bWwKICAgICJyZWdpb24iOiAiZXUtd2VzdC0xIgogIH0KCiAgIyBRdWV1ZSB1c2VkIHRvIGNvbW11bmljYXRlIHdpdGggTG9hZGVyCiAgInF1ZXVlIjogewogICAgIyBUeXBlIG9mIHRoZSBxdWV1ZS4gSXQgY2FuIGJlIGVpdGhlciBzcXMgb3Igc25zCiAgICJ0eXBlIjogInNxcyIsCiAgICAjIE5hbWUgb2YgdGhlIHNxcyBxdWV1ZQogICAgInF1ZXVlTmFtZSI6ICJzcC1zcXMtcXVldWUtZ2l0Yy1saXZlLmZpZm8iLAogICAgIyBSZWdpb24gb2YgdGhlIFNRUyBxdWV1ZS4KICAgICMgT3B0aW9uYWwgaWYgaXQgY2FuIGJlIHJlc29sdmVkIHdpdGggQVdTIHJlZ2lvbiBwcm92aWRlciBjaGFpbi4KICAgICJyZWdpb24iOiAiZXUtd2VzdC0xIgogIH0KICAjIFNOUyBleGFtcGxlOgogIyAicXVldWUiOiB7CiAgICAjIFR5cGUgb2YgdGhlIHF1ZXVlLiBJdCBjYW4gYmUgZWl0aGVyIHNxcyBvciBzbnMKICAgICMidHlwZSI6ICJzbnMiLAogICAgIyBBUk4gb2YgU05TIHRvcGljCiAgICAjInRvcGljQXJuIjogImFybjphd3M6c25zOmV1LXdlc3QtMToxNDI4OTIyMjgwNjY6c3Atc25zLXRvcGljLWdpdGMtbGl2ZS5maWZvIiwKICAgICMgUmVnaW9uIG9mIHRoZSBTTlMgdG9waWMKICAgICMicmVnaW9uIjogImV1LXdlc3QtMSIKICAjfQoKICAjIENvbmZpZ3VyZSB0aGUgd2F5IGluLWJhdGNoIGRlZHVwbGljYXRpb24gaXMgcGVyZm9ybWVkCiAgImRlZHVwbGljYXRpb24iOiB7CiAgICAjIFN5bmV0aHRldGljIGRlZHVwbGljYXRpb24gcmVhc3NpZ25zIG5ldyBpZHMgdG8gZXZlbnRzIHdpdGggc2FtZSBpZC1maW5nZXJwcmludHQgcGFpcgogICAgIyBEaWZmZXJlbnQgb3B0aW9ucyBjYW4gYmUgdHJpZWQgaWYgc3ludGhldGljIGRlZHVwbGljYXRpb24gYWZmZWN0cyBwZXJmb3JtYW5jZQogICAgInN5bnRoZXRpYyI6IHsKICAgICAgIyBDYW4gYmUgTk9ORSAoZGlzYWJsZSksIEJST0FEQ0FTVCBhbmQgSk9JTiAoZGlmZmVyZW50IGxvdy1sZXZlbCBpbXBsZW1lbnRhdGlvbnMpCiAgICAgICJ0eXBlIjogIkJST0FEQ0FTVCIKICAgICAgIyBEbyBub3QgZGVkdXBsaWNhdGUgcGFpcnMgd2l0aCBsZXNzLW9yLWVxdWFsIGNhcmRpbmFsaXR5CiAgICAgICJjYXJkaW5hbGl0eSI6IDEKICAgIH0KICB9CgogICMgU2NoZW1hLXNwZWNpZmljIGZvcm1hdCBzZXR0aW5ncyAocmVjb21tZW5kZWQgdG8gbGVhdmUgYWxsIHRocmVlIGdyb3VwcyBlbXB0eSBhbmQgdXNlIFRTViBhcyBkZWZhdWx0KQogICJmb3JtYXRzIjogewogICAgIyBGb3JtYXQgdXNlZCBieSBkZWZhdWx0IChUU1Ygb3IgSlNPTikKICAgICMgT3B0aW9uYWwsIGRlZmF1bHQgdmFsdWUgVFNWCiAgICAiZGVmYXVsdCI6ICJKU09OIiwKICAgICMgU2NoZW1hcyB0byBiZSBzaHJlZGRlZCBhcyBKU09OcywgY29ycmVzcG9uZGluZyBKU09OUGF0aCBmaWxlcyBtdXN0IGJlIHByZXNlbnQuIEF1dG9taWdyYXRpb25zIHdpbGwgYmUgZGlzYWJsZWQKICAgICMgT3B0aW9uYWwsIGRlZmF1bHQgdmFsdWUgW10KICAgICJqc29uIjogWyBdLAogICAgIyBTY2hlbWFzIHRvIGJlIHNocmVkZGVkIGFzIFRTVnMsIHByZXNlbmNlIG9mIHRoZSBzY2hlbWEgb24gSWdsdSBTZXJ2ZXIgaXMgbmVjZXNzYXJ5LiBBdXRvbWlnYXJ0aW9ucyBlbmFibGVkCiAgICAjIE9wdGlvbmFsLCBkZWZhdWx0IHZhbHVlIFtdCiAgICAidHN2IjogWyBdLAogICAgIyBTY2hlbWFzIHRoYXQgd29uJ3QgYmUgbG9hZGVkCiAgICAjIE9wdGlvbmFsLCBkZWZhdWx0IHZhbHVlIFtdCiAgICAic2tpcCI6IFsgXQogIH0sCgogICMgT2JzZXJ2YWJpbGl0eSBhbmQgcmVwb3J0aW5nIG9wdGlvbnMKICAjIm1vbml0b3JpbmciOiB7CiAgICAjIE9wdGlvbmFsLCBmb3IgdHJhY2tpbmcgcnVudGltZSBleGNlcHRpb25zCiAgIyAgInNlbnRyeSI6IHsKICAgIyAgICJkc24iOiAiaHR0cDovL3NlbnRyeS5hY21lLmNvbSIKICAgICN9CiAjIH0KfQo='
INFO Environment:
  PATH=/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/aws/puppet/bin/
  SECURITY_PROPERTIES=/emr/instance-controller/lib/security.properties
  HISTCONTROL=ignoredups
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=eu-west-1
  LANG=en_US.UTF-8
  MAIL=/var/spool/mail/hadoop
  LOGNAME=hadoop
  PWD=/
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-2KY1L1IFX4486/tmp
  _=/etc/alternatives/jre/bin/java
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  SHELL=/bin/bash
  QTINC=/usr/lib64/qt-3.3/include
  USER=hadoop
  HADOOP_LOGFILE=syslog
  HOSTNAME=ip-11-222-232-22
  QTDIR=/usr/lib64/qt-3.3
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-2KY1L1IFX4486
  EMR_STEP_ID=s-2KY1L1IFX4486
  QTLIB=/usr/lib64/qt-3.3/lib
  HOME=/home/hadoop
  SHLVL=1
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-2KY1L1IFX4486/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-2KY1L1IFX4486/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-2KY1L1IFX4486
INFO ProcessRunner started child process 4936
2022-02-16T05:50:01.976Z INFO HadoopJarStepRunner.Runner: startRun() called for s-2KY1L1IFX4486 Child Pid: 4936
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO Process 4936 still running
INFO Process 4936 still running
INFO Process 4936 still running
INFO Process 4936 still running

Hey @mgloel,

One more very useful source of information in newer AMIs is Spark History Server. You can check it in: Application user interfaces → Spark History UI.

In my experience it’s more reliable than Monitoring tab. But I think you’d have to interpret it on yourself as there’s quite a lot of details. What you’re after is big difference in step duration (between your old and new jobs) and/or failing executors.

Thanks for the tip @anton
The spark history looks quite consistent. This is just an excerpt but it keeps repeaing in similar fashion.:

{"Event":"SparkListenerTaskEnd","Stage ID":40,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":163683,"Index":64713,"Attempt":0,"Launch Time":1645007530247,"Executor ID":"1","Host":"ip-11-222-232-20.eu-west-1.compute.internal","Locality":"RACK_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1645007530333,"Failed":false,"Killed":false,"Accumulables":[{"ID":899,"Name":"internal.metrics.executorDeserializeTime","Update":1,"Value":55836,"Internal":true,"Count Failed Values":true},{"ID":900,"Name":"internal.metrics.executorDeserializeCpuTime","Update":1031126,"Value":67246190184,"Internal":true,"Count Failed Values":true},{"ID":901,"Name":"internal.metrics.executorRunTime","Update":83,"Value":7644180,"Internal":true,"Count Failed Values":true},{"ID":902,"Name":"internal.metrics.executorCpuTime","Update":11263882,"Value":912464174092,"Internal":true,"Count Failed Values":true},{"ID":903,"Name":"internal.metrics.resultSize","Update":1491,"Value":96290301,"Internal":true,"Count Failed Values":true},{"ID":917,"Name":"internal.metrics.shuffle.write.bytesWritten","Update":2707,"Value":298503562,"Internal":true,"Count Failed Values":true},{"ID":918,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":3,"Value":380812,"Internal":true,"Count Failed Values":true},{"ID":919,"Name":"internal.metrics.shuffle.write.writeTime","Update":433450,"Value":32475066535,"Internal":true,"Count Failed Values":true},{"ID":920,"Name":"internal.metrics.input.bytesRead","Update":1208,"Value":116379921,"Internal":true,"Count Failed Values":true},{"ID":921,"Name":"internal.metrics.input.recordsRead","Update":3,"Value":380812,"Internal":true,"Count Failed Values":true}]},"Task Executor Metrics":{"JVMHeapMemory":0,"JVMOffHeapMemory":0,"OnHeapExecutionMemory":0,"OffHeapExecutionMemory":0,"OnHeapStorageMemory":0,"OffHeapStorageMemory":0,"OnHeapUnifiedMemory":0,"OffHeapUnifiedMemory":0,"DirectPoolMemory":0,"MappedPoolMemory":0,"ProcessTreeJVMVMemory":0,"ProcessTreeJVMRSSMemory":0,"ProcessTreePythonVMemory":0,"ProcessTreePythonRSSMemory":0,"ProcessTreeOtherVMemory":0,"ProcessTreeOtherRSSMemory":0,"MinorGCCount":0,"MinorGCTime":0,"MajorGCCount":0,"MajorGCTime":0},"Task Metrics":{"Executor Deserialize Time":1,"Executor Deserialize CPU Time":1031126,"Executor Run Time":83,"Executor CPU Time":11263882,"Peak Execution Memory":0,"Result Size":1491,"JVM GC Time":0,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":0,"Local Blocks Fetched":0,"Fetch Wait Time":0,"Remote Bytes Read":0,"Remote Bytes Read To Disk":0,"Local Bytes Read":0,"Total Records Read":0},"Shuffle Write Metrics":{"Shuffle Bytes Written":2707,"Shuffle Write Time":433450,"Shuffle Records Written":3},"Input Metrics":{"Bytes Read":1208,"Records Read":3},"Output Metrics":{"Bytes Written":0,"Records Written":0},"Updated Blocks":[]}}
{"Event":"SparkListenerTaskStart","Stage ID":40,"Stage Attempt ID":0,"Task Info":{"Task ID":163685,"Index":64715,"Attempt":0,"Launch Time":1645007530468,"Executor ID":"1","Host":"ip-11-222-232-20.eu-west-1.compute.internal","Locality":"RACK_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Killed":false,"Accumulables":[]}}. .. ````

Would it make sense to switch to a bigger instance type? We ran r5.24xlarge to clean things up. However, this will get quite expensive.

WE opted for:

Master:
m4.large
Core:
r5.12xlarge

It runs smoothly and takes only 13 minutes. It might get a bit expensive on the long run though. :smiley: