Data modeling using Map Reduce

christophe · October 1, 2017, 12:54pm

Good question!

In snowplow data-modeling is done by running queries on database join and compute new table and load it back in database .

That’s indeed how most of our users do it. They load their Shredded Events into Redshift and run a SQL data modeling process within Redshift that outputs a set of derived tables.

The benefit of this approach is that it’s fairly easy to get a SQL data model up and running and iterate on it (the cluster is running 24/7 regardless and the model recomputes on the full dataset each time).

I usually recommend going with a SQL data modeling process if you don’t expect to quickly hit a billion total events. Once above that threshold, I usually see performance or Redshift costs become an issue. You can still use SQL if you expect to collect more data, but I’d instead use it to prototype the data model before porting it to something like Spark. An alternative option is to write an incremental model (tutorial). Another downside, however, is that SQL is not a great language for writing scalable and robust data modeling processes.

An alternative approach–and one we’re very bullish on–is to run the data modeling process on the Enriched Events, upstream of Redshift. We’re particularly excited about Spark. My colleague Anton wrote an excellent post on the topic: Replacing Amazon Redshift with Apache Spark for event data modeling [tutorial]

Topic		Replies	Views
Data Modeling on GCP	1	829	December 3, 2021
Using Snowplow data to feed other applications For data modelers & consumers	1	1995	September 5, 2016
Approaches to access data in S3 For data modelers & consumers	2	1414	May 18, 2021
Data modelling for real time kafka pipeline Enrichment	2	929	September 2, 2020
On-premise Realtime Pipeline For engineers	2	2236	January 3, 2018

Data modeling using Map Reduce

Related Topics