[shred] spark: Shred Enriched Events - Step does not end

Friends, we need some help.

We have had the snowplow for many years, without any problems.

It turns out that since yesterday and the stage shred [shred] spark does not end: Shred Enriched Events

He makes no mistake, I can’t identify anything.

You could help me, it’s very urgent.

Down the logs

Controller

021-01-28T14:57:27.311Z INFO Ensure step 3 jar file command-runner.jar
2021-01-28T14:57:27.311Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ewogInNjaGVtYSI6ICJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5pZ2x1L3Jlc29sdmVyLWNvbmZpZy9qc29uc2NoZW1hLzEtMC0xIiwKICJkYXRhIjogewogICJjYWNoZVNpemUiOiA1MDAsCiAgInJlcG9zaXRvcmllcyI6IFsKICAgewogICAgIm5hbWUiOiAiSWdsdSBDZW50cmFsIC0gR0NQIE1pcnJvciIsCiAgICAicHJpb3JpdHkiOiAyLAogICAgInZlbmRvclByZWZpeGVzIjogWwogICAgICJjb20uc25vd3Bsb3dhbmFseXRpY3MiCiAgICBdLAogICAgImNvbm5lY3Rpb24iOiB7CiAgICAgImh0dHAiOiB7CiAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSIKICAgICB9CiAgICB9CiAgIH0sCiAgIHsKICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsCiAgICAicHJpb3JpdHkiOiAxLAogICAgInZlbmRvclByZWZpeGVzIjogWwogICAgICJjb20uc25vd3Bsb3dhbmFseXRpY3MiCiAgICBdLAogICAgImNvbm5lY3Rpb24iOiB7CiAgICAgImh0dHAiOiB7CiAgICAgICJ1cmkiOiAiaHR0cDovL2lnbHVjZW50cmFsLmNvbSIKICAgICB9CiAgICB9CiAgIH0sCiAgIHsKICAgICJuYW1lIjogIkdldE5pbmphcyIsCiAgICAicHJpb3JpdHkiOiAwLAogICAgInZlbmRvclByZWZpeGVzIjogWwogICAgICJici5jb20uZ2V0bmluamFzIiwKICAgICAiY29tLnNub3dwbG93YW5hbHl0aWNzIgogICAgXSwKICAgICJjb25uZWN0aW9uIjogewogICAgICJodHRwIjogewogICAgICAidXJpIjogImh0dHA6Ly9nbi1pZ2x1LXNlcnZlci5zMy13ZWJzaXRlLXVzLWVhc3QtMS5hbWF6b25hd3MuY29tIgogICAgIH0KICAgIH0KICAgfQogIF0KIH0KfQ== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3a://XXXXXXX/shredded/bad/run=2021-01-28-14-52-14/'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-east-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-3P9BP0OWDYPO7/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-10-0-51-115
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-3P9BP0OWDYPO7
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-3P9BP0OWDYPO7/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-3P9BP0OWDYPO7/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-3P9BP0OWDYPO7
INFO ProcessRunner started child process 10067 :
hadoop   10067  4623  0 14:57 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --class com.snowplowanalytics.snowplow.storage.spark.ShredJob --master yarn --deploy-mode cluster s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar --iglu-config ewogInNjaGVtYSI6ICJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5pZ2x1L3Jlc29sdmVyLWNvbmZpZy9qc29uc2NoZW1hLzEtMC0xIiwKICJkYXRhIjogewogICJjYWNoZVNpemUiOiA1MDAsCiAgInJlcG9zaXRvcmllcyI6IFsKICAgewogICAgIm5hbWUiOiAiSWdsdSBDZW50cmFsIC0gR0NQIE1pcnJvciIsCiAgICAicHJpb3JpdHkiOiAyLAogICAgInZlbmRvclByZWZpeGVzIjogWwogICAgICJjb20uc25vd3Bsb3dhbmFseXRpY3MiCiAgICBdLAogICAgImNvbm5lY3Rpb24iOiB7CiAgICAgImh0dHAiOiB7CiAgICAgICJ1cmkiOiAiaHR0cDovL21pcnJvcjAxLmlnbHVjZW50cmFsLmNvbSIKICAgICB9CiAgICB9CiAgIH0sCiAgIHsKICAgICJuYW1lIjogIklnbHUgQ2VudHJhbCIsCiAgICAicHJpb3JpdHkiOiAxLAogICAgInZlbmRvclByZWZpeGVzIjogWwogICAgICJjb20uc25vd3Bsb3dhbmFseXRpY3MiCiAgICBdLAogICAgImNvbm5lY3Rpb24iOiB7CiAgICAgImh0dHAiOiB7CiAgICAgICJ1cmkiOiAiaHR0cDovL2lnbHVjZW50cmFsLmNvbSIKICAgICB9CiAgICB9CiAgIH0sCiAgIHsKICAgICJuYW1lIjogIkdldE5pbmphcyIsCiAgICAicHJpb3JpdHkiOiAwLAogICAgInZlbmRvclByZWZpeGVzIjogWwogICAgICJici5jb20uZ2V0bmluamFzIiwKICAgICAiY29tLnNub3dwbG93YW5hbHl0aWNzIgogICAgXSwKICAgICJjb25uZWN0aW9uIjogewogICAgICJodHRwIjogewogICAgICAidXJpIjogImh0dHA6Ly9nbi1pZ2x1LXNlcnZlci5zMy13ZWJzaXRlLXVzLWVhc3QtMS5hbWF6b25hd3MuY29tIgogICAgIH0KICAgIH0KICAgfQogIF0KIH0KfQ== --input-folder hdfs:///local/snowplow/enriched-events/* --output-folder hdfs:///local/snowplow/shredded-events/ --bad-folder s3a://XXXXXXXX/shredded/bad/run=2021-01-28-14-52-14/
2021-01-28T14:57:31.318Z INFO HadoopJarStepRunner.Runner: startRun() called for s-3P9BP0OWDYPO7 Child Pid: 10067
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...

stderr

Warning: Skip remote jar s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar.
21/01/28 14:57:32 INFO RMProxy: Connecting to ResourceManager at ip-10-0-51-115.ec2.internal/10.0.51.115:8032
21/01/28 14:57:32 INFO Client: Requesting a new application from cluster with 27 NodeManagers
21/01/28 14:57:32 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (54272 MB per container)
21/01/28 14:57:32 INFO Client: Will allocate AM container, with 30720 MB memory including 3072 MB overhead
21/01/28 14:57:32 INFO Client: Setting up container launch context for our AM
21/01/28 14:57:32 INFO Client: Setting up the launch environment for our AM container
21/01/28 14:57:32 INFO Client: Preparing resources for our AM container
21/01/28 14:57:33 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
21/01/28 14:57:35 INFO Client: Uploading resource file:/mnt/tmp/spark-e56d0893-d661-4b2d-8052-3c9325fdf007/__spark_libs__8806580616385151933.zip -> hdfs://ip-10-0-51-115.ec2.internal:8020/user/hadoop/.sparkStaging/application_1611845674308_0002/__spark_libs__8806580616385151933.zip
21/01/28 14:57:37 WARN RoleMappings: Found no mappings configured with 'fs.s3.authorization.roleMapping', credentials resolution may not work as expected
21/01/28 14:57:37 INFO Client: Uploading resource s3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar -> hdfs://ip-10-0-51-115.ec2.internal:8020/user/hadoop/.sparkStaging/application_1611845674308_0002/snowplow-rdb-shredder-0.13.1.jar
21/01/28 14:57:37 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-us-east-1/4-storage/rdb-shredder/snowplow-rdb-shredder-0.13.1.jar' for reading
21/01/28 14:57:38 INFO Client: Uploading resource file:/mnt/tmp/spark-e56d0893-d661-4b2d-8052-3c9325fdf007/__spark_conf__8489487149018878201.zip -> hdfs://ip-10-0-51-115.ec2.internal:8020/user/hadoop/.sparkStaging/application_1611845674308_0002/__spark_conf__.zip
21/01/28 14:57:38 INFO SecurityManager: Changing view acls to: hadoop
21/01/28 14:57:38 INFO SecurityManager: Changing modify acls to: hadoop
21/01/28 14:57:38 INFO SecurityManager: Changing view acls groups to: 
21/01/28 14:57:38 INFO SecurityManager: Changing modify acls groups to: 
21/01/28 14:57:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/01/28 14:57:38 INFO Client: Submitting application application_1611845674308_0002 to ResourceManager
21/01/28 14:57:38 INFO YarnClientImpl: Submitted application application_1611845674308_0002
21/01/28 14:57:39 INFO Client: Application report for application_1611845674308_0002 (state: ACCEPTED)
21/01/28 14:57:39 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1611845858784
	 final status: UNDEFINED
	 tracking URL: http://ip-10-0-51-115.ec2.internal:20888/proxy/application_1611845674308_0002/
	 user: hadoop
21/01/28 14:57:40 INFO Client: Application report for application_1611845674308_0002 (state: ACCEPTED)
21/01/28 14:57:41 INFO Client: Application report for application_1611845674308_0002 (state: ACCEPTED)
21/01/28 14:57:42 INFO Client: Application report for application_1611845674308_0002 (state: ACCEPTED)
21/01/28 14:57:43 INFO Client: Application report for application_1611845674308_0002 (state: ACCEPTED)
21/01/28 14:57:44 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:44 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.0.51.52
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1611845858784
	 final status: UNDEFINED
	 tracking URL: http://ip-10-0-51-115.ec2.internal:20888/proxy/application_1611845674308_0002/
	 user: hadoop
21/01/28 14:57:45 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:46 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:47 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:48 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:49 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:50 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:51 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:52 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:53 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:54 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:55 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:56 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:57 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)
21/01/28 14:57:58 INFO Client: Application report for application_1611845674308_0002 (state: RUNNING)

Hi @Mauro_Barletta,

In my experience if shred step just hangs it usually means the cluster is not capable enough. Did you have any traffic spikes? I’d try a beefier cluster or search something on the forum on how to optimize your EMR cluster, e.g. Optimizing and reducing shredding/loading costs - #2 by ihor

Hi @Anton,

I was also suspicious of this because we have a very large volume of events.

The first test was to increase the master instance, from m4.2xlarge to r4.2xlarge.
We have 27 core r4.2xlarge instances.

The second test was to take just 1 file from the enriched.

Our enriched is 11.1 GB

Both tests have the same behavior, it only generates bad files in the shredded folder and all files are empty.

That’s a very big cluster, @Mauro_Barletta, I don’t think it’s really justified. I’d recommend you to better use fewer, but beefier cores. If it’s 11.1GB of ungzipped data, try following configuration:

"configuration:
      yarn-site:
        yarn.nodemanager.vmem-check-enabled: ""false""
        yarn.nodemanager.resource.memory-mb: ""256000""
        yarn.scheduler.maximum-allocation-mb: ""256000""
      spark:
        maximizeResourceAllocation: ""false""
      spark-defaults:
        spark.dynamicAllocation.enabled: ""false""
        spark.executor.instances: ""9""
        spark.yarn.executor.memoryOverhead: ""3072""
        spark.executor.memory: ""22G""
        spark.executor.cores: ""3""
        spark.yarn.driver.memoryOverhead: ""3072""
        spark.driver.memory: ""22G""
        spark.driver.cores: ""3""
        spark.default.parallelism: ""108"""

On one r5.8xlarge instance and m4.large master.

Also you’re saying your bad bucket is full of files. Likely, this is your problem. What data in there?

@anton!

I will see about the configuration you recommended.

It will create several files in the structure below as in the example.
But they are all empty.

XXXXXXX/teste/shredded/bad/run=2021-01-28-03-42-45/_temporary/0/task_20210128183530_0001_m_000000/
part-00000-db0f786a-5385-4de8-8c33-507105b9688d-c000.txt
emr:
    ami_version: 5.9.0
    region: us-east-1
    jobflow_role: EMR_EC2_DefaultRole
    service_role: EMR_DefaultRole
    placement:
    ec2_subnet_id: subnet-c05d1099
    ec2_key_name: snowplow
    bootstrap: []
    software:
      hbase:
      lingual:
    jobflow:
      job_name: Snowplow Enrich and Shred
      master_instance_type: r4.2xlarge
      core_instance_count: 27
      core_instance_type: r4.2xlarge
      core_instance_ebs:      # Optional. Attach an EBS volume to each core instance.
        volume_size: 300      # Gigabytes
        volume_type: "gp2"
        volume_iops:          # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: true   # Optional. Will default to true
      task_instance_count: 0
      task_instance_type: m3.2xlarge
      task_instance_bid: 0.532
    bootstrap_failure_tries: 3
    configuration:
      spark:
        maximizeResourceAllocation: "false"
      spark-defaults:
        spark.executor.instances: "47"
        spark.yarn.executor.memoryOverhead: "3072"
        spark.executor.memory: "27G"
        spark.yarn.driver.memoryOverhead: "3072"
        spark.driver.memory: "27G"
        spark.executor.cores: "3"
        spark.driver.cores: "3"
        spark.default.parallelism: "282"
collectors:
  format: thrift
enrich:
  versions:
    spark_enrich: 1.16.0
  continue_on_unexpected_error: false
  output_compression: GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0

Good night, luckily we found the problem. The S3 bucket where the shredded was recorded had an instability problem making the copy very slow, a process that we were used to spending 40 minutes, it took more than 5 hours, with this we believed it was stuck.
@anton thank you for your attention.

Glad you found the problem, @Mauro_Barletta! Nevertheless I wanted to point out that you use a quite outdated version of the pipeline. We strongly recommend you to upgrade - recent versions are more performant, easier to debug and less fragile.