RDB Loader fails after upgrading the events-enrichment to version 2.0.0

Hi, fairly new to snowplow,

While trying to upgrade our pipeline i’ve bumped the snowplow-enrich to version 2.0.0 (newest) and also the snowplow-s3-loader

what version of the RDB loader should i use? 1.0.0? 1.1.0? 1.2.0?

currently I have a very old version of the rdb_shredder version= 0.13.0 and rdb_loader version 0.14.0 - the emr job fails (not a huge surprise).

I would very much appreciate any help on general guideline and best practice to setup loading events into redshift:

  • Which RDB version to use?
  • Should I run the process using the EmrEtlRunner (or is it deprecated?)
  • How to monitor the loading process

Hi @avi_eshel_ct,

Which version of Enrich/S3 Loader(s) did you use before?

Generally speaking the mentioned shredded/loader ( rdb_shredder version= 0.13.0 and rdb_loader version 0.14.0) should be compatible with the latest components. Given that, it is important to understand what error you are getting (for this you might need to dive into EMR logs).

Ideally, of course, we would recommend to use the latest recommended versions which are listed in the version compatibility matrix. However this will require some efforts from you, especially for RDB Loader. You will need to go up to 0.18.2 first which is the latest which runs with EER and then you will go to v1 where shredding and loading are separated. For upgrade guides you will look at Snowplow RDB Loader - Snowplow Docs.

Best,

2 Likes

Hi, thank you for your reply.

The s3 loader was actually the first component in the pipeline that i’ve upgraded from version 0.18 to 2.0.0rc2 - without giving too much thought about compatibility i’ve pushed the change to our dev, staging and production environments - without problems.

recently I’ve upgraded the stream-enrich in dev and staging and although it’s deployed and running the downstream RDB loader (which is only in staging) is failing with this error:
INFO Client: Deleted staging directory hdfs://ip-10-5-215-178.ec2.internal:8020/user/hadoop/.sparkStaging/application_1631198412915_0003 Exception in thread "main" org.apache.spark.SparkException: Application application_1631198412915_0003 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1104) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 21/09/09 15:56:57 INFO ShutdownHookManager: Shutdown hook called

I’m not too familiar with debugging EMRs so not sure you can make out the error from the above log.

[EDITED]
Found this error in the EMR output:
Data loading error [Amazon](500310) Invalid operation: Cannot COPY into nonexistent table nl_basjes_yauaa_context_1; ERROR: Data loading error [Amazon](500310) Invalid operation: Cannot COPY into nonexistent table nl_basjes_yauaa_context_1; Following steps completed: [Discover]
I guess this error explains what’s missing in my Redshift schema, but not sure how to create the new schema? also, how can i make sure that all my other enrichments are supported by my Redshift Cluster.

In regards to the upgrade steps (going to 0.18.2 then to v1) can’t I simply deploy the newest version “alongside” the old (current) process and then I’ll just change the s3-loader output bucket (to the new RDB input bucket)?

Hello @avi_eshel_ct,

I guess this error explains what’s missing in my Redshift schema, but not sure how to create the new schema? also, how can i make sure that all my other enrichments are supported by my Redshift Cluster.

You can find DDL for this table in our iglu-central repository: https://github.com/snowplow/iglu-central/blob/master/sql/nl.basjes/yauaa_context_1.sql. Once the table is created you will need to resume your EER job from rdb_load step.

From RDB Loader R32 ( rdb_shredder version= 0.16.0 and rdb_loader version 0.17.0) new tables are created automatically by the loader. The same applies for changes in existing tables.

In regards to the upgrade steps (going to 0.18.2 then to v1) can’t I simply deploy the newest version “alongside” the old (current) process and then I’ll just change the s3-loader output bucket (to the new RDB input bucket)?

In theory you can but in practice it might be hard to put all changes together at once. You are currently on R28 and the best path would be:

  1. Upgrade to R30 as described in this post: https://snowplowanalytics.com/blog/2018/08/28/snowplow-rdb-loader-r30-released-with-stability-improvements/#upgrading
    1.1 Bump shredder and loader
    1.2 Update target file
  2. Upgrade to R32 as described in https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/r32-upgrade-guide/
    2.1 Bump shredded, loader and ami
    2.2 Update target file
    2.3 Deploy Iglu Server (if you don’t have one), add it to your iglu_resolver file
    2.4 Find out which schemas should be in blacklistTabular or you can fix appropriate tables to auto-migrate them in future
  3. And then you can consider deploying v1

Best,

1 Like