S3DistCp Step: Enriched HDFS -> S3 keeps failing


#1

Hi there,

I am trying to setup the EmrEtlRunner process to run, however, it always fails in the S3DistCp Enriched HDFS -> S3 step.

I’ve checked previous posts on the same error - but no solution for my case.

The topology is the Lambda one:

Trackers (JS, PHP, Pixel) -> Collector (Scala Stream) ->

Kinesis (streams good, bad) -> Scala Stream Enrich -> Kinesis S3 (enriched) (THIS WORKS)
Kinesis (streams good, bad) -> Kinesis S3 (raw) -> EmrEtlRunner (THIS FAILS)

I’ve tried running with --skip staging as well. I’ve added the log level DEBUG but it didn’t add anything new.

I’ve checked the name of all Kinesis streams, S3 buckets, and nothing is incorrect.

Any other ideas?

Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:927) at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-39-89.us-west-2.compute.internal:8020/tmp/cdf15f73-76c7-40d5-a6cc-861d10048635/files at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:317) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:352) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:901) ... 10 more

2017-09-25 04:19:37,296 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: --src hdfs:///local/snowplow/enriched-events/ --dest s3://XXXXX-events-enriched/good/run=2017-09-25-04-06-52/ --srcPattern .*part-.* --s3Endpoint s3-us-west-2.amazonaws.com 2017-09-25 04:19:38,345 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): S3DistCp args: --src hdfs:///local/snowplow/enriched-events/ --dest s3://XXXXX-events-enriched/good/run=2017-09-25-04-06-52/ --srcPattern .*part-.* --s3Endpoint s3-us-west-2.amazonaws.com 2017-09-25 04:19:38,421 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Using output path 'hdfs:/tmp/cdf15f73-76c7-40d5-a6cc-861d10048635/output' 2017-09-25 04:19:41,233 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Created 0 files to copy 0 files 2017-09-25 04:19:50,810 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Reducer number: 3 2017-09-25 04:19:51,066 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-39-89.us-west-2.compute.internal/172.31.39.89:8032 2017-09-25 04:19:52,511 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1506312682959_0004 2017-09-25 04:19:52,519 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Try to recursively delete hdfs:/tmp/cdf15f73-76c7-40d5-a6cc-861d10048635/tempspace

The log from the Enrich task:

17/09/25 04:17:46 INFO RMProxy: Connecting to ResourceManager at ip-172-31-39-89.us-west-2.compute.internal/172.31.39.89:8032 17/09/25 04:17:46 INFO Client: Requesting a new application from cluster with 2 NodeManagers 17/09/25 04:17:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container) 17/09/25 04:17:47 INFO Client: Will allocate AM container, with 2048 MB memory including 384 MB overhead 17/09/25 04:17:47 INFO Client: Setting up container launch context for our AM 17/09/25 04:17:47 INFO Client: Setting up the launch environment for our AM container 17/09/25 04:17:47 INFO Client: Preparing resources for our AM container 17/09/25 04:17:51 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 17/09/25 04:17:57 INFO Client: Uploading resource file:/mnt/tmp/spark-6a804fb0-92e6-4b13-b1f5-8c422e1c6d25/__spark_libs__1150425498665415086.zip -> hdfs://ip-172-31-39-89.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1506312682959_0003/__spark_libs__1150425498665415086.zip 17/09/25 04:18:12 INFO Client: Uploading resource s3://snowplow-hosted-assets-us-west-2/3-enrich/spark-enrich/snowplow-spark-enrich-1.9.0.jar -> hdfs://ip-172-31-39-89.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1506312682959_0003/snowplow-spark-enrich-1.9.0.jar 17/09/25 04:18:12 INFO S3NativeFileSystem: Opening 's3://snowplow-hosted-assets-us-west-2/3-enrich/spark-enrich/snowplow-spark-enrich-1.9.0.jar' for reading 17/09/25 04:18:18 INFO Client: Uploading resource file:/mnt/tmp/spark-6a804fb0-92e6-4b13-b1f5-8c422e1c6d25/__spark_conf__9044663379866809420.zip -> hdfs://ip-172-31-39-89.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1506312682959_0003/__spark_conf__.zip 17/09/25 04:18:18 INFO SecurityManager: Changing view acls to: hadoop 17/09/25 04:18:18 INFO SecurityManager: Changing modify acls to: hadoop 17/09/25 04:18:18 INFO SecurityManager: Changing view acls groups to: 17/09/25 04:18:18 INFO SecurityManager: Changing modify acls groups to: 17/09/25 04:18:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 17/09/25 04:18:18 INFO Client: Submitting application application_1506312682959_0003 to ResourceManager 17/09/25 04:18:18 INFO YarnClientImpl: Submitted application application_1506312682959_0003 17/09/25 04:18:19 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:19 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1506313098722 final status: UNDEFINED tracking URL: http://ip-172-31-39-89.us-west-2.compute.internal:20888/proxy/application_1506312682959_0003/ user: hadoop 17/09/25 04:18:20 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:21 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:22 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:23 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:24 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:25 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:26 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:27 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:28 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:29 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:30 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:31 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:32 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:33 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:34 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:35 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:36 INFO Client: Application report for application_1506312682959_0003 (state: ACCEPTED) 17/09/25 04:18:37 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:37 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 172.31.38.225 ApplicationMaster RPC port: 0 queue: default start time: 1506313098722 final status: UNDEFINED tracking URL: http://ip-172-31-39-89.us-west-2.compute.internal:20888/proxy/application_1506312682959_0003/ user: hadoop 17/09/25 04:18:38 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:39 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:40 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:41 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:42 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:43 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:44 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:45 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:46 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:47 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:48 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:49 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:50 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:51 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:52 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:53 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:54 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:55 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:56 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:57 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:58 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:18:59 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:00 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:01 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:02 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:03 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:04 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:05 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:06 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:07 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:08 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:09 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:11 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:12 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:13 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:14 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:15 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:16 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:17 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:18 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:19 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:20 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:21 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:22 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:23 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:24 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:25 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:26 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:27 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:28 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:29 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:30 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:31 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:32 INFO Client: Application report for application_1506312682959_0003 (state: RUNNING) 17/09/25 04:19:33 INFO Client: Application report for application_1506312682959_0003 (state: FINISHED) 17/09/25 04:19:33 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 172.31.38.225 ApplicationMaster RPC port: 0 queue: default start time: 1506313098722 final status: SUCCEEDED tracking URL: http://ip-172-31-39-89.us-west-2.compute.internal:20888/proxy/application_1506312682959_0003/ user: hadoop 17/09/25 04:19:33 INFO ShutdownHookManager: Shutdown hook called 17/09/25 04:19:33 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-6a804fb0-92e6-4b13-b1f5-8c422e1c6d25 Command exiting with ret '0'


#2

SOLUTION:

in the Kinesis LZO S3 I was using GZIP as compression instead of LZO. After changing, it worked fine.

SUGGESTION:

The enrichment process could detect that and generate an alert message in the syserr so we do not waste time trying to figure this out.


#3

Glad you fixed it @cmartins, thanks for letting us know.