[IMPORTANT ALERT] R90-R91 bug may result in shredded types not loading into Redshift after recovery


#1

Summary

In R90 and R91, if you pass --skip shred to EmrEtlRunner, for example during a recovery after a failed database load, then RDB Loader loads only atomic.events, skipping tables for self-describing events and contexts.

The relevant bug report in snowplow/snowplow is issue #3403; it was introduced in R90 and is also found in R91.

Who is affected

If you believe you could have resumed a failed pipeline in this way, the first step is to check your atomic.manifest table:

SELECT * FROM atomic.manifest WHERE etl_tstamp BETWEEN '2017-07-27’ AND '2017-08-24’ AND shredded_cardinality = 0;

Assuming 2017-07-27 was the date your pipeline was upgraded to R90+. This query returns all loads done by RDB Loader that don’t have shredded types.

If this query returns no rows, then you have not been affected by this bug - but read on to avoid being caught by this bug before you upgrade to R92+.

If this query does return rows, if having zero shredded types is uncommon for your data-pipeline and if you have you done a pipeline recovery since the upgrade, then it’s likely that you have been affected by this bug.

How to recover

The easiest way to recover is using the new RDB Loader, version 0.13.0-rc4. Using this version, you can launch a “dry run” for a specific folder in your shredded archive, which generates all of the COPY SQL statements that would be executed for a regular load into Redshift.

First, you need to download the RDB Loader to your local machine:

$  aws s3 cp s3://snowplow-hosted-assets/4-storage/rdb-loader/snowplow-rdb-loader-0.13.0-rc4.jar .

Then, the following script should allow you to launch RDB Loader from local machine:

#!/bin/bash

set -e

# Fail-fast with error-message
function die() {
    echo "$@" 1>&2 ; exit 1;
}

# Checking what uuid generator is available
if command -v uuid >/dev/null 2>&1; then
  UUID_APP=uuid
elif command -v uuidgen >/dev/null 2>&1; then
  UUID_APP=uuidgen
else
  die "Either uuid or uuidgen must be installed"
fi

JARFILE="$(pwd)/snowplow-rdb-loader-0.13.0-rc4.jar"

[ "$#" -eq 6 ] || die "Please specify paths to: config.yml, redshift.json, resolver.json, s3://archive-shredded-path, s3://log-path, run=id"

config_yml=${1}
redshift_json=${2}
resolver_json=${3}
folder_path=${4}/${6}
log_path=${5}/rdb-loader/${6}

[ ! -f $config_yml ] && die "Cannot find EmrEtlRunner config file ${config_yml}"
[ ! -f $redshift_json ] && die "Cannot find Redshift target file ${redshift_json}"
[ ! -f $resolver_json ] && die "Cannot find Iglu resolver config file ${resolver_json}"

config_yml_b64=$(cat ${config_yml} | base64 -w 0)
redshift_json_b64=$(cat ${redshift_json} | base64 -w 0)
resolver_json_b64=$(cat ${resolver_json} | base64 -w 0)

UUID=$($UUID_APP)

logkey=$log_path/$UUID

java -jar $JARFILE \
  --dry-run \
  --skip consistency_check \
  --config $config_yml_b64 \
  --target $redshift_json_b64 \
  --folder $folder_path \
  --logkey $logkey \
  --resolver $resolver_json_b64

This script passes arguments to RDB Loader as EmrEtlRunner would do it, but it’s completely safe to invoke it - it does not perform any actions with Redshift and just prints load statements to local stdout. It accepts 6 required positional arguments:

  1. Path to config.yml
  2. Path to Redshift storage target configuration. It won’t try to connect to Redshift, but will complain if configuration is invalid.
  3. Path to Iglu Resolver configuration JSON
  4. Path to shredded archive. This can be extracted from config.yml as aws.s3.buckets.shredded.archive
  5. Path to log folder. This can be extracted from config.yml as aws.s3.buckets.log, but also any available bucket can be used
  6. Run id (folder name) of flawed load, such as run=2017-08-13-14-01-11

For example:

$ shredded-recovery.sh snowplow/config.yml snowplow/targets/redshift.json snowplow/resolver.json s3://acme-snowplow/archive/shredded/good s3://acme-snowplow/logs run=2017-08-13-14-01-11

Run this and you will see listed the “Performed SQL Queries” (don’t worry, nothing really performed thanks to the dry run). Now:

  • Copy this SQL into your editor
  • Delete everything but the COPY statements for the shredded types - i.e. omit the SQL statements for atomic.events and atomic.manifest (otherwise you will double-load the events into atomic.events)

Assuming your load had only two shredded types (should be much more in real-world scenario), statements should be look like following:

BEGIN;

COPY atomic.com_snowplowanalytics_snowplow_submit_form_1 FROM 's3://acme-snowplow/archive/shredded/good/run=2017-08-13-14-01-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=submit_form/format=jsonschema/version=1-'
 CREDENTIALS 'aws_iam_role=arn:aws:iam::123454468890:role/RedshiftLoadRole' JSON AS 's3://snowplow-hosted-assets-us-east-1/4-storage/redshift-storage/jsonpaths/com.snowplowanalytics.snowplow/submit_form_1.json'
 REGION AS 'us-east-1'
 MAXERROR 1 TRUNCATECOLUMNS TIMEFORMAT 'auto'
 ACCEPTINVCHARS ;

COPY atomic.com_snowplowanalytics_snowplow_change_form_1 FROM 's3://acme-snowplow/archive/shredded/good/run=2017-08-13-14-01-11/shredded-types/vendor=com.snowplowanalytics.snowplow/name=change_form/format=jsonschema/version=1-'
 CREDENTIALS 'aws_iam_role=arn:aws:iam::123454468890' JSON AS 's3://snowplow-hosted-assets-us-east-1/4-storage/redshift-storage/jsonpaths/com.snowplowanalytics.snowplow/change_form_1.json'
 REGION AS 'us-east-1'
 MAXERROR 1 TRUNCATECOLUMNS TIMEFORMAT 'auto'
 ACCEPTINVCHARS ;

COMMIT;

Also, it’s recommended to update manifest table to reflect the new state by updating the shredded_cardinality column for the affected run:

UPDATE atomic.manifest SET shredded_cardinality = 2 WHERE etl_tstamp = '2017-07-03 06:58:25.891';

Now you have all statements RDB Loader had to perform for to load data into Redshift. You can save them into file shred-recovery.sql and load manually via psql:

$ psql -h acme-cluster.abcdefgh01xnr.us-east-1.redshift.amazonaws.com -p 5439 snowplow -U analyst -f shred-recovery-sql

After this load completed, Redshift should have a correct state with all shredded types. If you discovered more than one load with missing shredded types - it’s safe to repeat above steps while all folders are not loaded.

How to avoid this issue

If you are running R90 or R91, do not attempt to run the database load with an EmrEtlRunner command which:

  • Either explicitly includes --skip shred, or
  • Implicitly skips shred, by resuming from the rdb_load or archive_raw steps

This bug will be addressed in the upcoming R92 Maiden Castle release.


Snowplow R92 Maiden Castle released
Serializable isolation violation on table
#2

#3

Hey @anton!

I’m trying to run your bash script, but get the following error. Any ideas how to solve this?

Exception in thread "main" java.lang.NoClassDefFoundError: java/util/Base64
	at com.snowplowanalytics.snowplow.rdbloader.config.CliConfig$.base64decode(CliConfig.scala:158)
	at com.snowplowanalytics.snowplow.rdbloader.config.CliConfig$.transform(CliConfig.scala:138)
	at com.snowplowanalytics.snowplow.rdbloader.config.CliConfig$$anonfun$parse$1.apply(CliConfig.scala:99)
	at com.snowplowanalytics.snowplow.rdbloader.config.CliConfig$$anonfun$parse$1.apply(CliConfig.scala:99)
	at scala.Option.map(Option.scala:146)
	at com.snowplowanalytics.snowplow.rdbloader.config.CliConfig$.parse(CliConfig.scala:99)
	at com.snowplowanalytics.snowplow.rdbloader.Main$.main(Main.scala:33)
	at com.snowplowanalytics.snowplow.rdbloader.Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: java.util.Base64
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	... 8 more

EDIT: Nevermind. I was using Java 1.7, and Base64 is only included in Java 1.8+