Rerun storage loader from archived files

fwahlqvist · January 28, 2017, 12:11pm

Hey all,
A great supporter of Snowplow analytics and have made sure many of my clients are using it, however i normally don’t deploy it myself but have great people to support on this, however trying to learn more and have my own setup that i have been running with cloudfront for a couple of month.
I have also decided to enable Redshift to try and learn more in this area for deployments.

Up to now i have been running the ETL at adhoc and used --skip download,load and accessed the data directly from s3 however now i would like to load all data to redshift and looking for a point in the right direction.

My buckets are

‘’‘
enriched:
good: s3://xxx/enriched/good
bad: s3://xxx/enriched/bad
errors: s3://xxx/enriched/errors
archive: s3://xxx/enriched/archive
shredded:
good: s3://xxx/shredded/good
bad: s3://xxx/shredded/bad
errors: s3://xxx/shredded/errors
archive: s3://xxx/shredded/archive
’’’

And i currently have only loaded the last load in to Redshift to make sure it works.

Can anyone give any pointer how i rerun storage-loader to make sure my Redshift gets fully populated (basic question hopefully)

Many thanks

alex · January 28, 2017, 12:45pm

Hey @fwahlqvist - at the moment this is unfortunately quite difficult. The reason is that the enrich-and-load process is a top-down stop-the-world process - the different components aren’t (yet) sufficiently decoupled to be able to “rewind” the StorageLoader to the start of your archive and perform all those loads.

The good news is that you at least have all the data in the Redshift-specific format, in your shredded archive. So we can probably gaffer tape something together.

I haven’t tried this, but I think you want to write a script, pseudo-code as follows:

  for each run=xxx in s3://xxx/shredded/archive:
    copy the run=xxx to s3://xxx/shredded/good/run=xxx
    run the StorageLoader

This should load each run one-by-one. Make sure to pause regular operation of the pipeline while you run this script, and once it has completed, re-enable the load step in StorageLoader.

Let us know how you get on!

fwahlqvist · January 28, 2017, 5:57pm

Hey @Alex,
Thank you for quick answers, did a quick test by doing a simple batch script

removed bash script because of formatting errors

Before i run the script a truncated the relevant tables and storage loader populated the tables fine on initial inspection.
My current understanding based on quickly reviewing the data is that the enrichment process run first and then the shredding process, so currently don’t see a need to run anything with the enrichment folders.
Do you have any more documentation of the different steps in the storage loader or is the code the best place?

Many thanks

alex · January 29, 2017, 12:27pm

Hi @fwahlqvist - this is the definitive page on the batch pipeline steps:

Topic		Replies	Views
Storage Loader successful but not loading Redshift or Postgres DB Storage targets	4	1905	March 28, 2017
Loading data from s3 to Redshift after EmrEtlRunner Troubleshooting	7	3239	November 19, 2018
How to transform and load data from s3 into redshift Troubleshooting	3	335	December 11, 2023
Error loading data to Redshift Storage targets	4	1170	May 3, 2019
Can't load data back into redshift Troubleshooting	7	1749	June 4, 2018

Rerun storage loader from archived files

Related Topics