Snowflake Transformer on failing on long job


#1

I’m running the Snowflake transformer on a large backlog of data, so the job is running for 6+ hours. It’s just failed with the following message:

Failure Message
18/07/02 04:30:09 INFO Client: Application report for application_1530484047344_0001 (state: FINISHED)
18/07/02 04:30:09 INFO Client: 
	 client token: N/A
	 diagnostics: User class threw exception: shadeaws.services.dynamodbv2.model.AmazonDynamoDBException: The security token included in the request is expired (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ExpiredTokenException; Request ID: 68RCJVDDVAOET7N9VGO6GJPRMFVV4KQNSO5AEMVJF66Q9ASUAAJG)
	 ApplicationMaster host: 172.31.40.159
	 ApplicationMaster RPC port: 0
	 queue: default
	 start time: 1530484225027
	 final status: FAILED
	 tracking URL: http://ip-172-31-43-69.eu-west-1.compute.internal:20888/proxy/application_1530484047344_0001/
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1530484047344_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/02 04:30:09 INFO ShutdownHookManager: Shutdown hook called
18/07/02 04:30:09 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-18b2a698-11d7-4a93-a965-0d5c38c68f3f
Command exiting with ret '1'

I’m assuming I can just re-run to carry on where it left off? Is there anything I can do to avoid this error in future?

Thanks!

Iain


#2

Hi @iain,

Actual useful message should be in YARN logs. Somewhere in EMR logs:

[jobflow-id]/containers/application_1530484047344_0001/stderr.gz

I assume you’ll find there that your DynamoDB token has expired and Transformer just couldn’t write back to a table. Problem is that Transformer acquires a token once at the beginning and that token couldn’t be used again after several hours.

You cannot simply restart the pipeline because in that case transformer (and loader) will simply skip folder which was not marked as “processed”. So you need to manually fix the manifest table (and probably S3).

If indeed Transformer processed multiple folders and accidentally just stuck after Nth on DynamoDB, you can just delete S3 folder from snowflake stageUrl (not in enriched.archive!) and same record from manifest.

If Transformer processed only single folder and it took 6 hours then it will likely fail again, so you also will have to bump EMR cluster.

It is also possible to mark folder as “processed” manually to avoid processing it again (which can be appealing in case of bery big folder), but mutating manifest is dangerous and you easily can end up with inconsistent state, so we advice just to delete DynamoDB record and S3 folder and start over again.


#3

Thanks Anton, it was a multiple run job so I have deleted the DynamoDB record and staged data and run again.