Redshift Spectrum instead of loading via COPY

trung · June 24, 2019, 12:29pm

We are using a redshift instance type dc2.large and quickly reaching it’s storage limit. It makes sense we offload the atomic schema into redshift spectrum as we don’t often query this data and mainly use it for data modelling once a day.

Does there exist any ETL process which can transform into spectrum ready files?

Konstantinos_Servis · June 25, 2019, 9:45am

Hey @trung,

You can use your archive files in spectrum directly or by turning them into parquet (using glue). Have a look at this article for some ideas: https://snowplowanalytics.com/blog/2019/04/04/use-glue-and-athena-with-snowplow-data/

rahulj51 · June 30, 2019, 8:29am

This is good for one-time. But I guess one has to write a scheduled job to add more partitions as the data comes-in.

Also, is it possible to not run the Redshift copy job at all and instead replace it with a job to create Spectrum partitions? Would we still get the benefits of deduplication that the storage process contains?

Konstantinos_Servis · July 1, 2019, 6:54am

Yes and AWS Glue makes that rather easy.

In theory you could but spectrum will not be as performant, I would think, as having the data locally in the native format. What I would do in your case is keep only the most recent data up to what your storage allows and keep adding partitions for spectrum.

I would really need to look closely into the loader deduplication to tell you if that would be included, but I would expect that it would be if you are reading from the shredded archive.

Topic		Replies	Views
Using shredded data for loading into databricks as parquet format	6	648	October 7, 2022
Rerun storage loader from archived files Storage targets	3	1463	January 29, 2017
EmrEtlRunner sink Shredded data into S3 bucket For engineers	0	599	November 11, 2019
Doing additional ETL processing outside of Redshift/Postgres? AWS batch pipeline (Legacy)	3	1624	April 28, 2017
Upload data to ClickHouse For engineers	1	661	November 19, 2021

Redshift Spectrum instead of loading via COPY

Related Topics