Dataflow-runner - EMR cluster not terminated after completion

Hey guys,

Sorry I’ve been smashing the forums over the last couple of weeks. Glad to say that we’ve managed to get our snowplow pipeline working and enriched events flowing into Snowflake.

We are ready to schedule the emr-etl-runner job and the data-flow-runner job.

There is an issue with when the dataflow-runner job completes. The EMR cluster does not terminate itself for some reason and stays in status “Waiting - Cluster Ready”

Could I please get your help to understand why this is happening and how to make sure the cluster terminates after successful completion?

Have a great weekend!

Ryan

Hey @Ryan_Newsome,

Don’t apologise! We’re happy to see you’re engaged and asking for help!

The EMR cluster should shut down if it’s created as a transient cluster. Otherwise, it’ll be persistent and will just wait for the next job to run.

At volume, it can more efficient to run the load jobs on a persistent cluster, and load in a micro-batch style (ie kicking off a new job on the same cluster as soon as the last job finishes) - since there’s a cost to the time that the cluster takes to spin up and down again.

If you don’t need that, I believe you’ll just need to make a change in the config which creates the EMR cluster (if memory serves it’s part of your dataflow-runner configuration).

Here’s a similar thread on the topic, which might help: EMR ETL stream_enrich mode

Best,

Yep - you should be able to use the run-transient command in Dataflow runner for this.

As Colm has mentioned there’s a small cost associated with spinning up (bootstrapping) an EMR cluster, so if you’d like to micro batch it is quite often cheaper and makes it easier to load more frequent batches into Snowflake. I’m not sure about your data volume but it looks like the loader is only taking 2 minutes while the transformer is taking 30 minutes - so you could well see some performance improvements by changing the node types.

Depending on your volume of data you could likely use a smaller master node and upgrade both nodes from m2 to m4 or m5 to see some performance and cost improvements. A m5.xlarge will give you twice as many vCPUs (4) then the equivalent m2.xlarge for approximately the same cost as well as faster networking which will speed up copy operations between S3 and the EMR cluster.

1 Like

Thanks for the info @Colm and @mike. We’ll assess having the cluster as persistent instead a bit later on. I’ll also bump test performance when upgrading nodes to m4/m5 (i think i tried this before and it failed for some reason, but I’ll try again)

For now we’d like to have the enrich and storage process scheduled to run a few times a day which is why we want to run it in transient mode.

The following command is being used when executing dataflow-runner however the EMR cluster is not terminating:

./dataflow-runner run-transient --emr-config ./config/cluster.json --emr-playbook ./config/playbook.json

Is there something else we need to be doing?

Still can’t get the cluster to terminate automatically. dataflow-runner always creates it as “After last step completes:Cluster waits”.

Do I need to pass through an additional argument in playbook.json or something?

Here is the playbook being used:

{
   "schema":"iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
   "data":{
      "region":"ap-southeast-2",
      "credentials":{
         "accessKeyId":"xxxxxxxxxx",
         "secretAccessKey":"xxxxxxxxxx"
      },
      "steps":[
         {
            "type":"CUSTOM_JAR",
            "name":"Snowflake Transformer",
            "actionOnFailure":"CANCEL_AND_WAIT",
            "jar":"command-runner.jar",
            "arguments":[
               "spark-submit",
               "--conf",
               "spark.hadoop.mapreduce.job.outputformat.class=com.snowplowanalytics.snowflake.transformer.S3OutputFormat",
               "--deploy-mode",
               "cluster",
               "--class",
               "com.snowplowanalytics.snowflake.transformer.Main",

               "s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.6.0.jar",

               "--config",
               "{{base64File "./config/self-describing-config.json"}}",
               "--resolver",
               "{{base64File "./config/iglu_resolver.json"}}"
            ]
         },

         {
            "type":"CUSTOM_JAR",
            "name":"Snowflake Loader",
            "actionOnFailure":"CANCEL_AND_WAIT",
            "jar":"s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.6.0.jar",
            "arguments":[
               "load",
               "--base64",
               "--config",
               "{{base64File "./config/self-describing-config.json"}}",
               "--resolver",
               "{{base64File "./config/iglu_resolver.json"}}"
            ]
         }
      ],
      "tags":[ ]
   }
}

@mike I also tried using the M5 instance types but they aren’t compatible with the emr-5-9-0. Do you simply just update the “amiVersion” value in cluster.json file to resolve this?

Hey guys, just wondering if you possible had any other ideas on things I could check to try and get the EMR cluster to terminate after successful run or dataflow-runner? I couldn’t find anything in any of the config json files relating to the process.

@mike @Colm

Is data flow runner generating any output / logs? (if not run with --log-level debug You should see some logs indicating that it is attempting to terminate the cluster (assuming no intermediate errors).

Hi @mike,

Well this is quite bizarre. I just re-ran the emr-etl-runner and dataflow-runner process about 5 more times, and every single time the dataflow-runner EMR cluster correctly terminated itself!

When i was testing this before and having issues, it was transforming/loading a much larger amount of enriched events (30+ GB). This is the only difference compared to when I did the recent runs and it worked (same config files, same execution command)

There are log files that generated in the S3 bucket for all previous dataflow-runner jobs including those that did not auto terminate the EMR cluster. I looked through the logs relating to the jobs that did not auto terminate the EMR cluster and couldn’t see anything that stood out. Is there a certain part of a certain file I should look at? I’d love to find out why the issue was occurring previously in case it helps others.

Cheers,
Ryan