Doing additional ETL processing outside of Redshift/Postgres?

We’re looking at moving the majority of our modeling out of Redshift into something like Spark. Although building modeling scripts in SQL is good for rapid development, we’re missing the features of a ‘proper’ programming language.

We’re currently using the batch pipeline. Has anyone tried doing this? Would the best approach be just to load the Redshift outputted TSVs into a table in something like HBase? Or to use a connector to a Redshift cluster to pull the data out, process it and then push it back in again? Or something else entirely? :slight_smile:

EDIT - The modeling we’re doing is based on clickpaths and collated user activity, so it’s not something we can do with the existing configurable enrichments API afaik.

Hello @iain,

Unfortunately, I don’t have an example of working solution for your problem, but I can suggest that run manifests we introduced in Python Analytics SDK 0.2.0 could be helpful for solving similar problems. Run manifests are lightweight mechanism that allows you to track what runs (directories inside enriched/archive) were processed by Spark (or other engines) inside your code with Python/Scala Analytics SDKs.

So, you could end up with somewhat similar to following pseudo-code:

run_ids = list_runids()
for run in run_ids:
    aggregated_data = process_enriched_data(run_id)
    store_to_postgres(aggregated_data)

End result will of course highly depend on what you want to do with your data, but at least this is some approach we’re currently exploring. I hope you’ll find it useful.

Cheers,
Anton

3 Likes

You’ve got quite a few options here depending on the volume of data you are looking at modelling.

This isn’t an exhaustive list but some things to think about.

  • Use the Redshift JDBC driver inside of Spark to connect to Redshift directly. This might take a bit of tweaking but you can adjust how many rows are written/read to Redshift in batch. Most likely suitable for smaller volume sizes.

  • Load the data from enriched/shredded TSVs. This is likely a good option for small-moderate data sizes. As you are loading data from TSV you’ll take a little bit of a performance hit from parsing the TSV format and getting Spark to infer/cast to data types.

  • Convert the enriched/shredded TSVs to Parquet or a similar columnar (or strictly typed) format before loading into Spark (and optionally compress these files). You have a great upfront cost of converting these files from TSV to a columnar based format but you’ll take less of a hit on the performance side if you plan to process this data multiple times.

You may want to combine some of these options and also perform some data modelling in advance on Redshift.

Hopefully that gives you some ideas!

3 Likes

Thanks - that’s given me some ideas to work with!