I have had the batch pipeline running successfully (CloudFront + EmrEtlRunner + Redshift) for a couple of months now, and recently tried out the
--use-persistent-jobflow option in hopes of speeding up the rate that I can load into Redshift. The option, however, seems to break the run by not creating either the
[archive_shredded] EMR steps.
Recreating the error looks something like this:
- Run snowplow-emr with
--use-persistent-jobflowfor the first time, creating the cluster --> runs all steps and completes successfully
- Run snowplow-emr again with
--use-persistent-jobflow--> re-using the cluster, runs all steps up to [rdb_load], then runs [archive_shredded], then finishes. It does not create the [archive_enriched] step.
- Run snowplow-emr a 3rd time with
--use-persistent-jobflow--> error “There seems to be an ongoing run of EmrEtlRunner: Cannot safely add enrichment step to jobflow,
s3://snowplow-emr-etl/enriched/good/ is not empty”
This issue seems to be happening consistently in the couple hours I have spent debugging it. EMR runs using an existing cluster skip either
Is this a known issue? Is
--use-persistent-jobflow compatible with the CloudFront collector setup? I am on EmrEtlRunner version 0.34.1. Happy to provide any additional information about the issue.