We have a homegrown AB test script that selects users into variants for experiments running on the site. Right now, we:
Create a custom context to track of the the experiment/variant groups a user has been selected into and attach that to trackPageview such that all page-view and page-ping events contain that context
We have a set of modeled tables that get built to do all business reporting off of based on example SQL files provided by @christophe - One of the modeled tables maps the web_page_id to the experiment/variant selections for the user on a page view. All other business reporting tables can join to this if we want to segment by AB test groups.
The issue that we are having is that even with a moderate amount of data, the data modeling step that builds the temp tables to track test variants takes the longest to run (over a half hour) and we’ve noticed that the size of that context table scales faster than even the atomic events table (i.e. for 54 million events in the events table, there are currently about 86 million rows just in that custom context table.
Which makes me worried about how long the modeling scripts will take to run as our data scales up (this is only about a months worth of data so far). Given that for our data modeling we really only need to be able to join the web_page_id (from the web page context) to the AB test selections, it seems wasteful to have to store those selections for each page ping as well. Is there a recommended alternative way to do this. I’m wondering if I can have snowplow only fire that context on the initial pave-view event and not for all subsequent page view events. Or if I should model a separate event type altogether that captures the user AB test variant selection but only fires once per page view? Or are there other suggestions for how to handle this?