Modeling, Deduplication and Architectures


#1

I am new to Snowplow and my team told me they were having performance issues. I started data profiling and noted over 484,000,000 rows collected in four days. I also noted that in many tables such as a brand table, the rows were duplicated. From a logical modeling standpoint, there really is only one brand. That brand can have many events. Joining data with this level of duplication would impact performance.

Why was the decision to model event as the parent table? Was this to make it simpler for people to add new tables to the model (user defined contexts)?

If it is not possible to change this duplication of data in Snowplow, what are the common processes, tools used by Snowplow users on Redshift to move data from Snowplow to a dimensional model? How much of the Snowplow user base uses Apache Hadoop, Apache Spark and Kafka for stream processing?

Thanks,
Paula


#3

Hi Paula,

let me try to tackle your questions:

Why was the decision to model event as the parent table? Was this to make it simpler for people to add new tables to the model (user defined contexts)?

  • Yes you are right this is for defining custom context, as things stand right now I don’t think there is any way to extend base event table apart from using the context tables which have to be joined back to the main event on Redshift at least. There’s another option for storing some custom data with struct_events but it only has 5 columns (prefixed se_) that can be used and usually is accompanied with extra contexts as well.

Joining data with this level of duplication would impact performance.

  • can you elaborate on which data is duplicated exactly? Do you see many events that generate the same event_id?

move data from Snowplow to a dimensional model?

  • I’m not sure how you envision this model, but from my guess you would like to have some sort of more cleaner version of the clickstream event data flattened out in 1 table? I think you would need to run additional batch transforms that would produce another unified_log kind of table that have these extra dimensions. Maybe its possible to extend standard snowplow batch pipeline to include this extra step.

How much of the Snowplow user base uses Apache Hadoop, Apache Spark and Kafka for stream processing?

  • We’re using Kafka atm for as a store target for the stream collector and then pushed to BigQuery. I’m pretty sure there are many uses that use similar or other alternative configurations.