Event recovery with Scala Stream Collector



I’m trying to reprocess events in the enriched/bad buckets following this tutorial.

The Hadoop event recovery job runs fine but when I launch the EmrEtlRunner job using my recovered events buckets the EMR job fails at the “Elasticity S3DistCp Step: Enriched HDFS -> S3” step:
Error running job ... Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-20-57-52.us-east-2.compute.internal:8020/tmp/847d0f19-e2e7-4a86-8f22-2e2db445a993/files

I’m guessing this is because the enrich step expects files in the lzo format but instead the recovered events are base64 encoded in txt files.

Is there a way to make the event recovery job works with events collected using the Scala Stream Collector ?