Snowplow Event Recovery EMR Errors

irufus · February 15, 2019, 7:59pm

Hi,

I’m trying to get Event Recovery working following the release last month.
However when I follow the steps in the docs I’m unable to get the step to execute on EMR.
If I supply the MainClass (as shown in the docs) I get the error:
Unexpected argument: com.snowplowanalytics.snowplow.event.recovery.Main
If I don’t supply that, I get the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

The cluster was created with the following config:
aws emr create-cluster --release-label emr-5.19.0
–instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large
–use-default-roles
–ec2-attributes SubnetIds=,KeyName=snowplow-ec2
–applications Name=Spark Name=Hadoop
–name=“Snowplow Event Recovery”
–log-uri s3://snowplow/logs

Are there any known issues around this or anything obvious I’m likely to have missed?

BenFradet · February 18, 2019, 8:55am

Hey @irufus,

I think the cli example in the readme is wrong, here’s what we use internally:

jar: command-runner.jar
args:

[
  "spark-submit",
  "--class", "com.snowplowanalytics.snowplow.event.recovery.Main",
  "--master", "yarn",
  "--deploy-mode", "cluster",
  "s3://snowplow-hosted-assets/3-enrich/snowplow-event-recovery/snowplow-event-recovery-spark-0.1.0.jar",
  "--input", "hdfs:///local/to-recover/",
  "--output", "hdfs:///local/recovered/",
  "--config", "..."
]

Just logged an issue to that effect: https://github.com/snowplow-incubator/snowplow-event-recovery/issues/22

irufus · February 19, 2019, 6:48pm

Thanks @BenFradet!
That got things moving - the step is running now. I’ll try and address that PR later on tonight if it’s still open then.

I’m still having issues though if you can help.
I’m seeing the step running but it doesn’t finish. All I can see in the available logs are messages saying that it’s running (in the stderr logs for some reason?):
19/02/19 17:46:17 INFO Client: Application report for application_1550597023778_0001 (state: RUNNING)
And in the controller logs:
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO Process still running

I’m seeing nothing output, the output directory hasn’t been created etc

I’m testing this on a small amount of data, so I wouldn’t expect the job to take more than a few minutes to run. Even with invalid config I’d expect the job to complete with nothing recovered?
Don’t if it makes a difference - but my input and output are S3, not HDFS. Though I assumed it was fine given I received an exception about the output already existing if I created it beforehand.

BenFradet · February 20, 2019, 9:19am

Hey @irufus,

Internally, we only read/write from/to HDFS and we use s3-dist-cp for HDFS <=> S3, so I wouldn’t be able to advise.

However, we’re keen on hearing what you find out!

irufus · February 20, 2019, 3:25pm

It was just the configuration of the EMR cluster - changed that to match the cluster we use for the ETL job and it completed in about a minute

Thanks for the help @BenFradet!

Milan_Mathew · January 7, 2021, 4:22pm

@BenFradet Sorry to reopen this case.
I am also facing the same issue of java.lang.ClassNotFoundException: com.snowplowanalytcs.snowplow.event.recovery.Main

Following is the EMR spark application arguments :
spark-submit --class com.snowplowanalytcs.snowplow.event.recovery.Main --master yarn --deploy-mode cluster s3://snowplow-hosted-assets/3-enrich/snowplow-event-recovery/snowplow-event-recovery-spark-0.1.0.jar --input s3://sp-dev-badevents/ --output s3://sp-dev-badevents/ --config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRpY3Muc25vd3Bsb3cvcmVjb3Zlcmllcy9qc29uc2NoZW1hLzEtMC0wIiwKICAiZGF0YSI6IFt7CiAgICAibmFtZSI6ICJSZXBsYWNlSW5CYXNlNjRGaWVsZEluQm9keSIsCiAgICAiZXJyb3IiOiAiaW5zdGFuY2UgdmFsdWUgKFwib2ZmIHBsYW5cIikgbm90IGZvdW5kIGluIGVudW0gKHBvc3NpYmxlIHZhbHVlczogW1wib2ZmX3BsYW5cIixcImNvbXBsZXRlZFwiLG51bGxdKVxuICBsZXZlbDogXCJlcnJvclwiXG4gc2NoZW1hOiB7XCJsb2FkaW5nVVJJXCI6XCIjXCIsXCJwb2ludGVyXCI6XCIvcHJvcGVydGllcy9jb21wbGV0aW9uX3N0YXR1c1wifVxuIGluc3RhbmNlOiB7XCJwb2ludGVyXCI6XCIvY29tcGxldGlvbl9zdGF0dXNcIn1cbiAgICBkb21haW46IFwidmFsaWRhdGlvblwiXG4gICAga2V5d29yZDogXCJlbnVtXCJcbiAgICB2YWx1ZTogXCJvZmYgcGxhblwiXG4gICAgZW51bTogW1wib2ZmX3BsYW5cIixcImNvbXBsZXRlZFwiLG51bGxdXG4iLAogICAgImJhc2U2NEZpZWxkIjogImN4IiwKICAgICJ0b1JlcGxhY2UiOiAiXCJvZmYgcGxhblwiIiwKICAgICJyZXBsYWNlbWVudCI6ICJcIm9mZiBwbGFuXCI6XCJvZmZfcGxhblwiIgogIH1dCn0=

Can you please guide me ?

ihor · January 8, 2021, 9:22pm

@Milan_Mathew, could you check my reply in Snowplow Event recovery?

Topic		Replies	Views
Event recovery failing - "Failed to load class com.snowplowanalytics.snowplow.event.recovery.Main." For engineers	0	585	November 8, 2022
Bad Event Recovery Failing! Troubleshooting	20	3034	December 6, 2022
Error when running snowplow spark event recovery 0.1.0 on EMR Troubleshooting	2	1310	April 20, 2021
Reprocessing Bad Events, EmrEtlRunner Error Troubleshooting	7	1903	August 23, 2017
Exception in emr step of loading data in redshift AWS batch pipeline (Legacy)	8	1871	July 31, 2018

Snowplow Event Recovery EMR Errors

Related Topics