Disable shredding on EMR


#1

In my workflow I don’t use neither Postgres nor Redshift as storage target. I just download the enriched events from s3 and use the python sdk kit to work with them.

So it’s perfectly ok if the emr-etl-runner ended right after enriching and skipped the shredding + everything after altogether. Is this possible? Running it with --skip shred option results in error No run folders in [s3n://splw-company-out/shredded/good/] found


#2

Interesting… I’ve never tried skipping shredding (after I started using it).

At which point are you receiving the error?

Perhaps the shred folders are just required as placeholders. Have you got placeholders set?


#3

@pocin, Skipping just shred is not sufficient. You rather need to skip shred,rdb_load,archive_shredded.


#4

@robkingston the error was even before the emr cluster started in aws, so I guess something like a pre-flight check.

@ihor Aha that makes sense!
For my future reference this schema would help the past me https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps

Thanks a lot :beers: , there is so much to wrap my head around :slight_smile: