Configuring the playbook.json for loading Snowflake


#1

Hello,

I am trying to setup snowplow and snowflake. I am following the guide but I dont understand how to properly setup the playbook.json. Is there a specific config and resolver that needs to be created for the transformer and loader? Are they different files for each?

Thank you


#2

Hi @llabe027,

Transformer and loader should use single central config file. They use different properties from it, but nevertheless values should pass validation.

Just to make sure - am I right that you already have a successfully running enrich job? I’m asking because setting up main snowplow pipeline (without shredding and Redshift loading) would be a first logical step, Snowflake can catch-up later without problems. Also, having running pipeline you’ll get rough idea what should be in snowflake configuration JSON (e.g. what are stageUrl and input).

Please, let us know if you have any questions about particular properties or ideas how we can improve guide.

UPD: resolver config also should be same for both applications. You can use a standard one.


#3

Hello,

What do you mean “have a successfully running enrich job”? Are you referring to having been able to run the snowplow-emr-elt-ruuner? If so, yes, I have had that run correctly. To stop the program from running the shreddering process, do I just remove “shredded” section from the config.json?


#4

Hi @llabe027,

No, the configuration file will remain the same. You need to specify --skip option with all steps which you want to exclude.

If it’s a shred step, it will look so:

--skip shred,rdb_load,archive_shredded

All available options are described here.

Note, if you decide to load the data into Redshift later, you will need to run the shred step on enriched data separately.

Hope this helps.


#5

Just a heads-up, @llabe027, you technically don’t need to stop shredding process - but it is useless if you want to load data only in Snowflake.

Snowplow pipeline consists of three (actually more, but this is irrelevant now) steps: Enrich, Transform, Load. Where Enrich is always primary and common step, no matter what you’re going to do with your data: load to particular DB, process with Spark or Athena or anything else.

EmrEtlRunner by default launches Enrich and then RDB Shredder (Transform step) and RDB Loader (Load step, obviously) to load data to Redshift. But if you’re planning to load only to Snowflake, you can get away without RDB Shredder/Loader as @egor mentioned and instead launch Snowflake Transformer and Snowflake Loader, which will use same enriched data. Difference is that Snowflake loading cannot be launched via EmrEtlRunner, but instead dataflow-runner should be used, which is more generic (nothing snowplow-specific) tool to orchestrate jobs.

Also, am I right that you use Setup Guide from our wiki?


#6

@anton, yes you are correct. That is what I am using that guide. I am currently on the sub-section Setting up Snowflake. I am receiving the following when I am run: java -jar snowplow-snowflake-loader-0.3.1.jar
setup
–config loader.json
–resolver iglu_resolver.json

Exception in thread “main” net.snowflake.client.jdbc.SnowflakeSQLException: Cannot perform CREATE SCHEMA. This session does not have a current database. Call ‘USE DATABASE’, or use a qualified name.

Going through the logs to determine the problem


#7

@llabe027, just to cite the guide:

Two things you need to create manually are Snowflake database and DynamoDB table. After database is created, you can run setup.

So, setup complains that it cannot create a schema, because database that you specified in config does not exist itself. You need to create a database first through Snowplow web console. And then setup will be able to deploy all other necessary entities.


#8

It seemed to have been a mix of a permissions and schema name problem. in the “schema” I had to put “<NAME_OF_DB>.atomic”


#9

@anton is there somewhere to manually download the transformer jar? This is because when I run the dataloader, the following appears at the in the logs:

Warning: Skip remote jar s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.3.1.jar.


#10

Well, all our EMR assets are available at snowplow-hosted-assets S3 bucket (s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.3.1-rc1.jar), but I’m not sure what you can to do with it - it needs to be downloaded by EMR cluster, you cannot run it on local machine. Is there anything else in logs? This message doesn’t look like critical error for me, but rather some warning.


#11

Apologies for the delay.

The Snowflake transformer fails before completing the task. Here is the output.

stderr.txt output:

18/02/26 13:55:35 INFO Client: 
 client token: N/A
 diagnostics: Application application_1519652783564_0001 failed 1 times due to AM Container for appattempt_1519652783564_0001_000001 exited with  exitCode: 1
Diagnostics: Exception from container-launch.
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
	at org.apache.hadoop.util.Shell.run(Shell.java:479)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
	 ApplicationMaster host: N/A
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1519653324466
	 final status: FAILED
	 user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1519652783564_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/02/26 13:55:35 INFO ShutdownHookManager: Shutdown hook called
18/02/26 13:55:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-691cdfa8-757c-4906-801b-24d05a1f2c45
Command exiting with ret '1'

stdout.txt

Illegal base64 character 2e


#12

I think problem is that you’re trying to pass one of configurations (resolver.json or config.json) as non-base64-encoded strings. Dataflow Runner’s base64File function makes sure of that. Though it seems our config file from documentation misses that. Transformer step should look like:

         {
            "type":"CUSTOM_JAR",
            "name":"Snowflake Transformer",
            "actionOnFailure":"CANCEL_AND_WAIT",
            "jar":"command-runner.jar",
            "arguments":[
               "spark-submit",
               "--deploy-mode",
               "cluster",
               "--class",
               "com.snowplowanalytics.snowflake.transformer.Main",

               "s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-transformer-0.3.1.jar",

               "--config",
               "{{base64File "./config.json"}}",
               "--resolver",
               "{{base64File "./resolver.json"}}"
            ]
         },

Fixed in docs.


#13

Does it also need the “–base64”, argument?


#14

Nope. Transformer should always be launched on EMR and therefore arguments can be passed only as base64-decoded strings. Whereas Loader can also be launched on local machine with plain file paths.