Data modeling using Map Reduce

Hey @shashankn91,

Good question!

In snowplow data-modeling is done by running queries on database join and compute new table and load it back in database .

That’s indeed how most of our users do it. They load their Shredded Events into Redshift and run a SQL data modeling process within Redshift that outputs a set of derived tables.

The benefit of this approach is that it’s fairly easy to get a SQL data model up and running and iterate on it (the cluster is running 24/7 regardless and the model recomputes on the full dataset each time).

I usually recommend going with a SQL data modeling process if you don’t expect to quickly hit a billion total events. Once above that threshold, I usually see performance or Redshift costs become an issue. You can still use SQL if you expect to collect more data, but I’d instead use it to prototype the data model before porting it to something like Spark. An alternative option is to write an incremental model (tutorial). Another downside, however, is that SQL is not a great language for writing scalable and robust data modeling processes.

An alternative approach–and one we’re very bullish on–is to run the data modeling process on the Enriched Events, upstream of Redshift. We’re particularly excited about Spark. My colleague Anton wrote an excellent post on the topic: Replacing Amazon Redshift with Apache Spark for event data modeling [tutorial]

2 Likes