Error: Directory Already Exists when running Snowflake transformer



When I run the dataflow-runner to run the Snowflake Pipeline I receive the following error:

8/03/20 13:55:36 ERROR ApplicationMaster: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory <name of directory> already exists

Is there something in the configuration files that must be set?


Hi @llabe027 - can you talk us through how you have set this up?


I am running a Clojure Collector with Google tags to track javascript events. We are trying to create a Snowplow + Snowflake pipeline. I have an s3 bucket for this that contains folders for each of the pieces of the pipeline. The snowplow-emr-elt-runner works for enriching the data. However, when I run the data-runner for the Snowflake transformer and loader I get:
Output directory s3a://<bucketName>/<snowflakeFolder>/data/run=2018-03-16-10-13-56 already exists

In order for the transformer to properly run, i have to delete all the folders that are currently in s3a://<bucketName>/<snowflakeFolder>/data/


What are you using for your Dataflow Runner playbook?


is this what you are referring to?

            "name":"Snowflake Transformer",
               "{{base64File "./loader.json"}}",
               "{{base64File "./iglu_resolver.json"}}"

            "name":"Snowflake Loader",
               "{{base64File "./loader.json"}}",
               "{{base64File "./iglu_resolver.json"}}"
      "tags":[ ]


If this makes a difference, I noticed that the loader and transformer were both 0.3.0 in the playbook.json file when i was running the datarunner. I just switched the version before i sent it to you.


Hey @llabe027,

This isn’t something we’ve seen before. Also your playbook looks correct and switching to 0.3.1 shouldn’t have any unexpected effect.

However, I’m wondering if you’re trying to use persistent cluster? I.e. common snowflake loader architecture assumes that for each run new cluster is bootstrapped and then destroyed after finishing its job.

Also, what’s directory behind <name of directory>, is it your archive on S3 or HDFS path (presumably)?


The <name of directory> is the S3 bucket where the snowflake data is stored. I have two separate folder in the bucket, one for the snowplow data to be stored while the other is for snowflake. In the snowplow/data/archive directory it would appear directories for each run are created and remain until deleted. This is the same case with the snowflake/data directory, it have to delete the directories created after each run.


Does it mean you already have processed data, so this problem just appeared?

Also, is <name of directory> is something like s3://mybucket/snowflake/data/archive/ or s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/? It seems that problem is simply that folder is really exist.


<name of directory> is like s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/. And yes, it seems to be because the folder exists. Should I be configuring the bucket to delete old runs?


Technically, you can add aws s3 rm statement in launching script after dataflow-runner, however I’m more confused now on why did Transformer tries to re-process it. Could you please share with me how DynamoDB manifest for this s3://mybucket/snowflake/data/archive/run=2018-03-22-10-00-00/ looks like?


Is this what you are referring to?

    AddedAtNumber:	1521554135
    AddedByString:	0.3.0
    RunIdString:	snowplow/data/archive/enriched/run=2018-03-16-10-13-56/
    ToSkipBoolean: false


Yep, that’s what I’m referring to. Thanks. I’ll try to figure out how is it possible that transformer is trying to overwrite the directory.

Right now, you can safely delete existing directory - from manifest record I can tell that it was not loaded yet.