Snowflake Loader Documentation - Version incompatibilities and manifest references before definition

Hi!

We’re back with new strengths in our effort to upgrade our Snowplow system. We think the collector thing is solved and how now moved on to the Snowflake Loader. When reading the official documentation, I’ve noticed a couple things that leads to confusion unless you bother to understand the code.

  1. The documentation recommends using “amiVersion”:“5.9.0” and version 0.8.2 of s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.8.2.jar". However this combination seems to give the error “java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)”.

This is also the case for the setup. “wget https://github.com/snowplow-incubator/snowplow-snowflake-loader/releases/download/0.8.1/snowplow-snowflake-loader-0.8.2.jar” where it mixes 0.8.1 and 0.8.2

I noticed however that there is a version 0.9.0 implicitly recommends ami version 6.4.0 and seems to resolve that error although I haven’t been able to run it end to end yet.

  1. The playbook.json references events_manifest.json, however this is not introduced until you read the Cross-batch deduplication page. This page further does not explicitly state that you have to create this table manually, but I guess this is the case?
1 Like

Hi @medicinal-matt,

Thanks for the report - I’ll bump versions and make the purpose of events_manifest.json clearer.

But answering your question - you’re right, events_manifest.json is optional and used only for cross-batch deduplication, you can omit it.

Also, AMI 6.4.0 is recommended for 0.9.0 apps (which means you don’t need the --conf option - I’ll fix it as well): Snowplow Snowflake Loader 0.9.0 released

1 Like

Looks better already!

which means you don’t need the --conf option - I’ll fix it as well

What is meant by this?

If you leave out the emr cluster config (cluster.json), you get --emr-config needs to be specified.

If you leave it out of the transformer in the playbook.json you get

Missing expected flag --config!

Usage: snowplow-snowflake-transformer --config --resolver [–inbatch-deduplication] [–events-manifest ] [–s3a]

Not sure if you did finished your planned changes, but the cluster config still says

      "ec2":{
         "amiVersion":"5.9.0",

Another thing I noticed. In playbook.json it is called

               "--config",
               "{{base64File "./config.json"}}",

but in the step earlier it is called /path/to/self-describing-config.json \

Maybe it would be more clear if they had the same name in both places?

1 Like

And another thing, as mentioned by this guy:

--s3Endpoint as s3.amazonaws.com defaults to region us-east-1 causing the error

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-2'

You need to manually change s3Endpoint for S3DistCp in playbook.json to your own region. In our case s3-eu-west-1.amazonaws.com

@anton: Maybe the combination of AMI 6.4.0 and 0.9.0 isn’t quite there either. Now I get

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.snowplowanalytics.snowflake.transformer.S3OutputFormat not found

in my containers/application_1642171377915_0002/container_1642171377915_0002_01_000001/stderr

The says steps/s-2RJYVHBLPA2JA/stderr says

Exception in thread "main" org.apache.spark.SparkException: Application application_1642171377915_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1253)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1645)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:

@medicinal-matt, yup that’s what I meant by --conf parameter.

There’s this part in your playbook:

“–conf”,
“spark.hadoop.mapreduce.job.outputformat.class=com.snowplowanalytics.snowflake.transformer.S3OutputFormat”,

And it was necessary only pre-0.9.0.

1 Like

Nice! That seems to fix the issues!

Now I have some “Error assuming AWS_ROLE”, but I see if I can solve that and otherwise it is a topic for another thread.

1 Like