I’d like to reprocess a bunch of bad rows that I collected with the Clojure collector.
I read the JSON for every bad row that I want to replay and extracted the
I modified the faulty content in the
line value and wrote every repaired row in a new file, in another bucket.
NB: I’m working in Python.
So I have a new bucket with a reprocessing file:
I changed the
config-repro.yml file to have the
in pointing to my reprocessing files in
First, the staging was silently failing. I changed the log format in
cloudfront and the staging step was successful.
But now, I encounter another error, during the EMR flow:
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-17-210.ec2.internal:8020/tmp/27690cd4-f2f6-47cf-aa16-b623325f4bd3/files
I read about it and most times, this is caused by the in/processing bucket being empty. But mine is not, the staging was successful and I can see the file in the processing “folder”.
I’ve noticed that Clojure logs format (syntax) is way different than what the line value in the JSON of bad rows. I’m wondering if this is the cause of my issues and how I should handle this. Should I try to rebuild a log file with the same syntax as Clojure’s ?
Else, does anyone has any idea of what’s happening?