Recalculating big data models - how to do it with better performance?


#1

Hi, everyone!

We are having some troubles trying to re-calculate a big data model, we already use incremental queries to avoid big data models to be recalculated each day but when we add new rows or change the data model in some way, we have to recalculate from scratch that in this case it’s taking 7-8 hours and sometimes never ends.

Anyone, it’s having this issue too? How can I solve it? Any recommendations?


#3

Hi @danielparedes,

but when we add new rows […] we have to recalculate from scratch

I’m not sure I follow. If the model is incremental, it shouldn’t have to recalculate from scratch if new rows are added?

On the broader point, that’s indeed a limitation of this kind of model: it’s hard to make large changes without fully recomputing from scratch. It might be possible to develop some hybrid approach, where the most expensive steps are done incrementally while others are recomputed from scratch each time–striking a balance between efficiency and flexibility.

As a sidenote, we’re bullish on Spark as a more scalable and reliable alternative to Redshift data modeling: Replacing Amazon Redshift with Apache Spark for event data modeling [tutorial]