Snowflake Loader Documentation - Version incompatibilities and manifest references before definition

medicinal-matt · January 14, 2022, 8:42am

Hi!

We’re back with new strengths in our effort to upgrade our Snowplow system. We think the collector thing is solved and how now moved on to the Snowflake Loader. When reading the official documentation, I’ve noticed a couple things that leads to confusion unless you bother to understand the code.

The documentation recommends using “amiVersion”:“5.9.0” and version 0.8.2 of s3://snowplow-hosted-assets/4-storage/snowflake-loader/snowplow-snowflake-loader-0.8.2.jar". However this combination seems to give the error “java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)”.

This is also the case for the setup. “wget https://github.com/snowplow-incubator/snowplow-snowflake-loader/releases/download/0.8.1/snowplow-snowflake-loader-0.8.2.jar” where it mixes 0.8.1 and 0.8.2

I noticed however that there is a version 0.9.0 implicitly recommends ami version 6.4.0 and seems to resolve that error although I haven’t been able to run it end to end yet.

The playbook.json references events_manifest.json, however this is not introduced until you read the Cross-batch deduplication page. This page further does not explicitly state that you have to create this table manually, but I guess this is the case?

anton · January 14, 2022, 10:43am

Hi @medicinal-matt,

Thanks for the report - I’ll bump versions and make the purpose of events_manifest.json clearer.

But answering your question - you’re right, events_manifest.json is optional and used only for cross-batch deduplication, you can omit it.

Also, AMI 6.4.0 is recommended for 0.9.0 apps (which means you don’t need the --conf option - I’ll fix it as well): Snowplow Snowflake Loader 0.9.0 released

medicinal-matt · January 14, 2022, 11:51am

Looks better already!

which means you don’t need the --conf option - I’ll fix it as well

What is meant by this?

If you leave out the emr cluster config (cluster.json), you get --emr-config needs to be specified.

If you leave it out of the transformer in the playbook.json you get

Missing expected flag --config!

Usage: snowplow-snowflake-transformer --config --resolver [–inbatch-deduplication] [–events-manifest ] [–s3a]

Not sure if you did finished your planned changes, but the cluster config still says

      "ec2":{
         "amiVersion":"5.9.0",

Another thing I noticed. In playbook.json it is called

               "--config",
               "{{base64File "./config.json"}}",

but in the step earlier it is called /path/to/self-describing-config.json \

Maybe it would be more clear if they had the same name in both places?

medicinal-matt · January 14, 2022, 1:18pm

And another thing, as mentioned by this guy:

--s3Endpoint as s3.amazonaws.com defaults to region us-east-1 causing the error

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-2'

You need to manually change s3Endpoint for S3DistCp in playbook.json to your own region. In our case s3-eu-west-1.amazonaws.com

medicinal-matt · January 14, 2022, 3:18pm

@anton: Maybe the combination of AMI 6.4.0 and 0.9.0 isn’t quite there either. Now I get

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.snowplowanalytics.snowflake.transformer.S3OutputFormat not found

in my containers/application_1642171377915_0002/container_1642171377915_0002_01_000001/stderr

The says steps/s-2RJYVHBLPA2JA/stderr says

Exception in thread "main" org.apache.spark.SparkException: Application application_1642171377915_0002 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1253)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1645)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:

anton · January 14, 2022, 3:22pm

@medicinal-matt, yup that’s what I meant by --conf parameter.

There’s this part in your playbook:

“–conf”,
“spark.hadoop.mapreduce.job.outputformat.class=com.snowplowanalytics.snowflake.transformer.S3OutputFormat”,

And it was necessary only pre-0.9.0.

medicinal-matt · January 14, 2022, 4:02pm

Nice! That seems to fix the issues!

Now I have some “Error assuming AWS_ROLE”, but I see if I can solve that and otherwise it is a topic for another thread.

Topic		Replies	Views
Snowflake loader - cross-batch deduplication configuration issues Troubleshooting	2	1189	January 17, 2022
Snowplow Snowflake Loader 0.7.0 released New releases	0	748	September 17, 2020
Snowflake Loader Setup Error For engineers	1	876	June 4, 2019
Issue with snowflake transformer/loader Troubleshooting	7	1033	March 4, 2021
RDB Loader 5.3.1 released (with important bug fix on Snowflake Loader) New releases	0	608	January 25, 2023

Snowflake Loader Documentation - Version incompatibilities and manifest references before definition

Related Topics