Disable shredding on EMR

pocin · September 4, 2018, 10:14am

In my workflow I don’t use neither Postgres nor Redshift as storage target. I just download the enriched events from s3 and use the python sdk kit to work with them.

So it’s perfectly ok if the emr-etl-runner ended right after enriching and skipped the shredding + everything after altogether. Is this possible? Running it with --skip shred option results in error No run folders in [s3n://splw-company-out/shredded/good/] found

robkingston · September 4, 2018, 11:52am

Interesting… I’ve never tried skipping shredding (after I started using it).

At which point are you receiving the error?

Perhaps the shred folders are just required as placeholders. Have you got placeholders set?

ihor · September 4, 2018, 4:23pm

@pocin, Skipping just shred is not sufficient. You rather need to skip shred,rdb_load,archive_shredded.

pocin · September 5, 2018, 8:24am

@robkingston the error was even before the emr cluster started in aws, so I guess something like a pre-flight check.

@ihor Aha that makes sense!
For my future reference this schema would help the past me https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps

Thanks a lot , there is so much to wrap my head around

Topic		Replies	Views
Shred problems using Batch Troubleshooting	1	825	December 5, 2020
Problem at S3 to HDFS S3DistCp step AWS batch pipeline (Legacy)	19	6982	June 4, 2021
EMR Shredding fails randomly Enrichment	12	1521	February 23, 2019
Snowplow not staging any logs and is not running the EMR jobs AWS batch pipeline (Legacy)	5	1705	July 8, 2017
EMR job failing Troubleshooting	4	820	November 15, 2021

Disable shredding on EMR

Related Topics