... Continuing the discussion from ETL Shred step taking longer and longer:
Didn't want to hijack the thread linked above. So, here's a new one. My pipeline is dead in the water. Error logs are cryptic - and make very little sense to me. What I observe is enrich is finishing rather quickly, shred takes abnormally longer. At some point it stalls, drops a few core nodes, resizes and then exits with errors. Console screenshot attached, maybe a massive coincidence, but it always breaks in the same place.
Errors I see are ( I sampled multiple repeated lines )
2017-03-28 04:37:45,247 WARN org.apache.hadoop.hdfs.DFSClient (IPC Server handler 2 on 10020): Failed to connect to /172.30.0.144:50010 for block, add to deadNodes and continue. java.net.NoRouteToHostException: No route to host
2017-03-28 03:43:16,271 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl (AsyncDispatcher event handler): Updating application attempt appattempt_1490669380627_0008_000001 with final state: FAILED, and exit status: -100
2017-03-28 03:43:16,272 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl (AsyncDispatcher event handler): appattempt_1490669380627_0008_000001 State change from FINAL_SAVING to FAILED
2017-03-28 03:43:16,272 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl (AsyncDispatcher event handler): The number of failed attempts is 0. The max attempts is 2
2017-03-28 04:35:14,286 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger (AsyncDispatcher event handler): USER=hadoop OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1490669380627_0009 failed 2 times due to AM Container for appattempt_1490669380627_0009_000003 exited with exitCode: -1000
Failing this attempt. Failing the application. APPID=application_1490669380627_0009
I doubled the compute nodes, added more juice to the master node ( it seemed to me the memory and disk capacity were creeping dangerously close to the redline ), didn't make a slightest dent.
I'm at my wit's end with this one. Every attempt to re-run produces different class of errors. Some are indicative of master node loosing its mind ( HDFS blocks missing ) some are as cryptic as the samples above.
* Any ideas?
* I've tried to use scala common enrich/shred as a dependency for for a realtime kinesis(Enriched)->[hypothetical service]->s3 (shredded)->redshift last resort development effort, but I can't figure out how to use the library in a Java context. My scala authoring skills are non-existent Has anyone managed to develop streaming shredder?
* Any pointers to setting up a spark beta pipeline?
* We've recently added a few custom unstructured events, but only tests made it into the pipeline, no significant volume to speak of. I've checked the assets ( jsonpaths ) are in the right place on s3 and schemas are happily congregating in Iglu scala server. Maybe I have missed something, new event-wise?