Rerun storage loader from archived files


#1

Hey all,
A great supporter of Snowplow analytics and have made sure many of my clients are using it, however i normally don’t deploy it myself but have great people to support on this, however trying to learn more and have my own setup that i have been running with cloudfront for a couple of month.
I have also decided to enable Redshift to try and learn more in this area for deployments.

Up to now i have been running the ETL at adhoc and used --skip download,load and accessed the data directly from s3 however now i would like to load all data to redshift and looking for a point in the right direction.

My buckets are

‘’‘
enriched:
good: s3://xxx/enriched/good
bad: s3://xxx/enriched/bad
errors: s3://xxx/enriched/errors
archive: s3://xxx/enriched/archive
shredded:
good: s3://xxx/shredded/good
bad: s3://xxx/shredded/bad
errors: s3://xxx/shredded/errors
archive: s3://xxx/shredded/archive
’’’

And i currently have only loaded the last load in to Redshift to make sure it works.

Can anyone give any pointer how i rerun storage-loader to make sure my Redshift gets fully populated (basic question hopefully)

Many thanks


#2

Hey @fwahlqvist - at the moment this is unfortunately quite difficult. The reason is that the enrich-and-load process is a top-down stop-the-world process - the different components aren’t (yet) sufficiently decoupled to be able to “rewind” the StorageLoader to the start of your archive and perform all those loads.

The good news is that you at least have all the data in the Redshift-specific format, in your shredded archive. So we can probably gaffer tape something together.

I haven’t tried this, but I think you want to write a script, pseudo-code as follows:

  for each run=xxx in s3://xxx/shredded/archive:
    copy the run=xxx to s3://xxx/shredded/good/run=xxx
    run the StorageLoader

This should load each run one-by-one. Make sure to pause regular operation of the pipeline while you run this script, and once it has completed, re-enable the load step in StorageLoader.

Let us know how you get on!


#3

Hey @Alex,
Thank you for quick answers, did a quick test by doing a simple batch script

removed bash script because of formatting errors

Before i run the script a truncated the relevant tables and storage loader populated the tables fine on initial inspection.
My current understanding based on quickly reviewing the data is that the enrichment process run first and then the shredding process, so currently don’t see a need to run anything with the enrichment folders.
Do you have any more documentation of the different steps in the storage loader or is the code the best place?

Many thanks


#4

Hi @fwahlqvist - this is the definitive page on the batch pipeline steps: