Loading from S3's enriched events to PostgreSQL

#1

Hi, how exactly do we load snowplow’s data from S3’s enriched folder to PostgreSQL?

Based on the Setup Guide, loading to PostgreSQL is automated when running snowplow-emr-etl-runner by setting the --target argument. This works for the raw data that are going to be enriched and eventually loaded to PostgreSQL.

But how about enriched data which already exists in S3? Can we do just the load-to-DB step only? I read something about using storage-loader-runner but this is deprecated method.

#2

Hi @aditya,

In order to load enriched data into a Postgres/Redshift, data needs to be shredded first (prepared to be loaded into this specific storage target). This is done by a component called RDB Shredder, orchestrated by EmrEtlRunner. Here’s a diagram of the process: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps

I think what should work for you is following process:

  1. Copy all enriched folders you want to load from enriched.archive to enriched.good
  2. Run EmrEtlRunner, starting from RDB Shredder step

Bear in mind however that Postgres support is very-very limited and considered experimental at the moment. It does not support any shredded entities (e.g. contexts and self-describing events), but only atomic data. It also uses fairly inefficient process to load data with copying it to a local machine first, so you need to make sure none of your enriched folders exceed EMR master’s free disk space. I’d restrain from using PostgreSQL at the moment and go with Redshift instead.

Good news though is that we have proper Postgres support on mind. No ETA yet, but my hope is that it will be available in 2019.

#3

Thanks @anton for responding. This is very helpful.

  1. If I am understanding it right, do we run the EmrEtlRunner with the option --target postgresql.json --resume-from-shred?
  2. Although it is deprecated, can we still use storage-loader-runner?
  3. I don’t actually know much about shredding. I probably skipping this step but I still have my data in tsv-format in enriched.archive. May I know what this step is for?

Thanks for the advice however Redshift’s cost is beyond justification at this stage so we opt for PostgreSQL. As long as the event data (which I assume data in the format defined here https://github.com/snowplow/snowplow/wiki/canonical-event-model which I also see in the enriched.archive) can be stored properly in PostgreSQL, I am good.

I am so so glad that Snowplow is going to have a more proper support of PostgreSQL in mind.

#4

Hi, if I re-run using the resume-from shred, I got error at archive_raw step. This is basically the error:

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-20-168.ap-southeast-1.compute.internal:8020/tmp/2502e778-27bc-4379-bee5-79d931acbcd5/files

Any idea?

#5

A reminder to myself and whoever may encounter this problem in future.

Here is what I did to make it work:

  • resume from RDB shredder (step 7 in Batch pipeline wiki) by setting switch --resume-from shred.
  • Then the EMR will fail at archive_raw steps because it is trying to move files from raw:processing (which is non-existent since I am re-moving data I have processed) to raw:archive.
  • Resume from rdb load by setting targets switch and --resume-from rdb_load switch properly. to continue the loading to postgresql.
  • Doing --skip archive_raw did not work because --skip and --resume-from switch options are mutually exclusive.

Kindly let me know though if this is the best practice to do a load of already-processed data to relational database.