We’re looking at moving the majority of our modeling out of Redshift into something like Spark. Although building modeling scripts in SQL is good for rapid development, we’re missing the features of a ‘proper’ programming language.
We’re currently using the batch pipeline. Has anyone tried doing this? Would the best approach be just to load the Redshift outputted TSVs into a table in something like HBase? Or to use a connector to a Redshift cluster to pull the data out, process it and then push it back in again? Or something else entirely?
EDIT - The modeling we’re doing is based on clickpaths and collated user activity, so it’s not something we can do with the existing configurable enrichments API afaik.