RDB Loader can hang for many hours


#1

Hello,
I’m trying to setup a testing snowplow pipeline with JSTracker -> scala collector -> kinesis -> s3 -> redshift, with the emr-etl on version r92 and rdb loader 0.13.0
I have some (8) custom self describing event types. I used the igluctl to create the tables in redshift and also the jsonpath files (I had to rename the jsonpath files, because my events contain a dash, but all good now). My buffer limits are very low, because of testing purposes, but I have below 100 files under 50KB in both enriched/good/* and shredded/good/* folders.
Now finally the rdbloader doesn’t throw an error (I’m running with the flag -f rdb_load), but it’s running for 2 hours now and it’s still not finished.

I found this issue mentioned here: https://github.com/snowplow/snowplow-rdb-loader/issues/26

I wanted to ask, if I could be doing something wrong, or if this is normal? Does this time increase marginally with increased event numbers? What’s the average run time of this step for you guys? Are there any recommendations on how to reduce this running time? (It shouldn’t take that long for ~1000 events)

Thank you,
Cheers
Filip


#2

Hello @filipgerat,

Does it hang before of after (as in ticket you refer to) load? It’s possible to check it with following actions:

  1. Download RDB Loader to local machine: aws s3 cp s3://snowplow-hosted-assets/4-storage/rdb-loader/snowplow-rdb-loader-0.13.0.jar .
  2. Check actual command of RDB Loader step from EMR console (EMR console -> expand job id -> look into “arguments” of “RDB Loader” step). It should be a lengthy command with your base64-encoded config (be aware - it can contain sensitive data)
  3. Run java -jar snowplow-rdb-loader-0.13.0.jar with same arguments from local machine, but with --dry-run option added.

With dry run, RDB Loader discovers all data you have on S3 and generates load statements for it without any real DB IO. It’s important to narrow down possible scope of this bug. If it won’t hang on dry run then likely bug is in tracker inside RDB Loader (do you use snowplow monitoring by the way?).


#3

Hello @anton,
Thank you very much for your response, you’ve helped me a lot! :slight_smile:
It was my fault, I have not commited the creation of the com_snowplowanalytics_snowplow_ua_parser_context_1. Now it works, and my events land in redshift, and the RDB load took 1 minute.

No, I don’t have the monitoring, as I haven’t gotten that far yet, and will probably set it up only when we plan to implement snowplow tracking in production.

Thanks again,
Cheers,
Filip


#4

Glad you figured it out @filipgerat! Hopefully RDB Loader will be able to create tables for you soon.