Doing additional ETL processing outside of Redshift/Postgres?

iain · April 27, 2017, 4:48pm

We’re looking at moving the majority of our modeling out of Redshift into something like Spark. Although building modeling scripts in SQL is good for rapid development, we’re missing the features of a ‘proper’ programming language.

We’re currently using the batch pipeline. Has anyone tried doing this? Would the best approach be just to load the Redshift outputted TSVs into a table in something like HBase? Or to use a connector to a Redshift cluster to pull the data out, process it and then push it back in again? Or something else entirely?

EDIT - The modeling we’re doing is based on clickpaths and collated user activity, so it’s not something we can do with the existing configurable enrichments API afaik.

anton · April 27, 2017, 5:41pm

Hello @iain,

Unfortunately, I don’t have an example of working solution for your problem, but I can suggest that run manifests we introduced in Python Analytics SDK 0.2.0 could be helpful for solving similar problems. Run manifests are lightweight mechanism that allows you to track what runs (directories inside enriched/archive) were processed by Spark (or other engines) inside your code with Python/Scala Analytics SDKs.

So, you could end up with somewhat similar to following pseudo-code:

run_ids = list_runids()
for run in run_ids:
    aggregated_data = process_enriched_data(run_id)
    store_to_postgres(aggregated_data)

End result will of course highly depend on what you want to do with your data, but at least this is some approach we’re currently exploring. I hope you’ll find it useful.

Cheers,
Anton

mike · April 28, 2017, 12:20am

You’ve got quite a few options here depending on the volume of data you are looking at modelling.

This isn’t an exhaustive list but some things to think about.

Use the Redshift JDBC driver inside of Spark to connect to Redshift directly. This might take a bit of tweaking but you can adjust how many rows are written/read to Redshift in batch. Most likely suitable for smaller volume sizes.
Load the data from enriched/shredded TSVs. This is likely a good option for small-moderate data sizes. As you are loading data from TSV you’ll take a little bit of a performance hit from parsing the TSV format and getting Spark to infer/cast to data types.
Convert the enriched/shredded TSVs to Parquet or a similar columnar (or strictly typed) format before loading into Spark (and optionally compress these files). You have a great upfront cost of converting these files from TSV to a columnar based format but you’ll take less of a hit on the performance side if you plan to process this data multiple times.

You may want to combine some of these options and also perform some data modelling in advance on Redshift.

Hopefully that gives you some ideas!

iain · April 28, 2017, 9:52am

Thanks - that’s given me some ideas to work with!

Topic		Replies	Views
Migrating from PostgreSQL to Redshift data store	11	1561	September 29, 2019
Rerun storage loader from archived files Storage targets	3	1463	January 29, 2017
Repopulate a single table? AWS batch pipeline (Legacy)	5	1265	January 16, 2018
Loading from S3's enriched events to PostgreSQL	4	1265	May 20, 2019
Redshift Spectrum instead of loading via COPY For engineers	3	1776	July 1, 2019

Doing additional ETL processing outside of Redshift/Postgres?

Related Topics