Snowplow Event Recovery EMR Errors


#1

Hi,

I’m trying to get Event Recovery working following the release last month.
However when I follow the steps in the docs I’m unable to get the step to execute on EMR.
If I supply the MainClass (as shown in the docs) I get the error:
Unexpected argument: com.snowplowanalytics.snowplow.event.recovery.Main
If I don’t supply that, I get the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

The cluster was created with the following config:
aws emr create-cluster --release-label emr-5.19.0
–instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large
–use-default-roles
–ec2-attributes SubnetIds=,KeyName=snowplow-ec2
–applications Name=Spark Name=Hadoop
–name=“Snowplow Event Recovery”
–log-uri s3://snowplow/logs

Are there any known issues around this or anything obvious I’m likely to have missed?


#3

Hey @irufus,

I think the cli example in the readme is wrong, here’s what we use internally:

  • jar: command-runner.jar
  • args:
[
  "spark-submit",
  "--class", "com.snowplowanalytics.snowplow.event.recovery.Main",
  "--master", "yarn",
  "--deploy-mode", "cluster",
  "s3://snowplow-hosted-assets/3-enrich/snowplow-event-recovery/snowplow-event-recovery-spark-0.1.0.jar",
  "--input", "hdfs:///local/to-recover/",
  "--output", "hdfs:///local/recovered/",
  "--config", "..."
]

Just logged an issue to that effect: https://github.com/snowplow-incubator/snowplow-event-recovery/issues/22


#4

Thanks @BenFradet!
That got things moving - the step is running now. I’ll try and address that PR later on tonight if it’s still open then.

I’m still having issues though if you can help.
I’m seeing the step running but it doesn’t finish. All I can see in the available logs are messages saying that it’s running (in the stderr logs for some reason?):
19/02/19 17:46:17 INFO Client: Application report for application_1550597023778_0001 (state: RUNNING)
And in the controller logs:
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO Process still running

I’m seeing nothing output, the output directory hasn’t been created etc

I’m testing this on a small amount of data, so I wouldn’t expect the job to take more than a few minutes to run. Even with invalid config I’d expect the job to complete with nothing recovered?
Don’t if it makes a difference - but my input and output are S3, not HDFS. Though I assumed it was fine given I received an exception about the output already existing if I created it beforehand.


#5

Hey @irufus,

Internally, we only read/write from/to HDFS and we use s3-dist-cp for HDFS <=> S3, so I wouldn’t be able to advise.

However, we’re keen on hearing what you find out!


#6

It was just the configuration of the EMR cluster - changed that to match the cluster we use for the ETL job and it completed in about a minute :slight_smile:

Thanks for the help @BenFradet!