I am new to Snowplow and my team told me they were having performance issues. I started data profiling and noted over 484,000,000 rows collected in four days. I also noted that in many tables such as a brand table, the rows were duplicated. From a logical modeling standpoint, there really is only one brand. That brand can have many events. Joining data with this level of duplication would impact performance.
Why was the decision to model event as the parent table? Was this to make it simpler for people to add new tables to the model (user defined contexts)?
If it is not possible to change this duplication of data in Snowplow, what are the common processes, tools used by Snowplow users on Redshift to move data from Snowplow to a dimensional model? How much of the Snowplow user base uses Apache Hadoop, Apache Spark and Kafka for stream processing?
Thanks,
Paula