Migrating from PostgreSQL to Redshift data store

Hi I am moving from using PostgreSQL to Redshift as my data store. Would like to ask suggestion on how best to do this.

I am planning to do either of this:

  1. Dump all postgresql DBs data (pg_dump dbname > dumpfile) and restore them to Redshift (psql dbname < dumpfile).
  2. Moving all the runs in shredded:archive folder to shredded:good and re-run the storage loading process in EmrEtlRunner using --resume_from rdb_load (or step 13 in the Batch Pipeline Diagram. Do I have to move runs from enriched:archive to enriched:good too if I use this method?

Would like to know which method is best to use as well as the advantage and disadvantage of both.

Thank you!

@aditya, I haven’t been involved in this kind of migration and thus cannot comment on the best option. I would like just to comment on the 2nd option.

All the archived shredded files would need to be in a single run=... folder when resuming from shred. Thus you either move one run folder at a time or combine the whole archive in a single run folder (if feasible, depending on the total volume). Make sure the folder structure in the run folder is maintained.

You do not have to move archived enriched but you might want to skip archive_enriched as the job might fail at that step if no files are found in the enriched good bucket.

Hi @ihor, Thanks for the suggestion. I also realized I still have run=... in my shredded:good folders which contain the folderName_$folder$ file which I assume flag that this particular folder has been processed. Should I keep them for run=... folders that I want to process? Or should I delete these folders first?

Also, I am wondering if you can let me know what these folderName_$folder$ file actually does.?

@aditya, The file folderName_$folder$ is a left-over from S3DistCp AWS utility used for moving files, which is orchestrated from within EMR cluster. Those files are beyond our control and you can safely delete them: https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/.

1 Like

Thank you @ihor for your explanation. Would have to try re-inject these run=... folder first I guess since I feel that it is cleaner albeit more tedious than aggregating all runs into a folder to process. Plus there are about 5GB+ of data accumulated since I started so it seems huge to be processed in a single EMR run.

Hi, do you know if emr-etl-runner supports processing of multiple run=... folders in one go? I am trying to copy all shredded:archive buckets to shredded:good bucket and process them all in one EMR job.

@aditya, I already replied to that question, you cannot do it

1 Like

Thanks @ihor. Maybe my understanding was wrong. I thought we have to move them one at a time but they can all be processed together.

Sorry @ihor, I have more questions. Do I have to pay any special attention if I want to copy all shredded files into a single run=... folder? I realize that the shredded files are divided into part-0000, part-0001, part-0002, etc. If I combine all together, there will be multiple part-0000s, part-0001s, etc.

Not sure if this might affect the EMR ETL job somehow?

And I tried to use resume-from rdb_load and --skip archive_enriched but they dont work, it says resume-from and skip are mutually exclusive. Any idea?

@aditya, the last file moved with a conflicting filename will override the previous file in the bucket. You can move all the files into a single run folder but you need to be sure the names do not conflict as you are going to “loose” your data (files will be overridden if names match) in that bucket.

Indeed, you cannot use --resume-from and --skip at the same time. As you need to skip archive_enriched step then you have to use --skip option providing the list of all the skip steps. The list of (comma separated) steps could be inferred from the appropriate dataflow diagram (depending on what mode you run EmrEtlRunner in, see the step name printed vertically next to the step number). Those steps are also listed in this wiki.

Thanks for the tips. Just for the record, I actually tried both methods of moving all archived data into a single run folder and moving individual run folders into shredded/good.

I managed to run the load job using both methods successfully. However, the manifest table turns out erratic. When I used first method, only the latest run job is recorded in manifest table.
When I used second method, most of the run job is recorded however some has duplicate etl_timestamp and event_count although the actual load job runs just fine without any data missing and without any duplicate.

I guess you are right, it’s better to run this one folder at a time unless there is some explanation on the behavior above.

@aditya, you can find the insights into how manifest table is used here: https://snowplowanalytics.com/blog/2018/08/29/snowplow-rdb-loader-r30-released-with-stability-improvements/#load-manifest. Note the section describing “Historical loading”.

1 Like