Hi @orcuna, this is a great topic, thanks for raising.
I think it’s helpful first to define what we mean by multi-tenancy (“MT”), given that it’s quite an overloaded term. I like this answer to a question about MT:
Following the definitions in that answer, you are suggesting the second type of MT:
\2. A shared database, separate schema.
First off: end-to-end multi-tenancy of Snowplow was never a design goal, although you are right that a lot of users have used the
app_id to achieve the third type of MT:
\3. A shared database, shared schema
Even if you are happy with this approach, there are at least two areas where Snowplow makes end-to-end MT difficult:
Enrichments. There is no concept in Snowplow of "only apply this enrichment configuration to events with this
app_id". To put it another way, each enrichment is a singleton, with a singular configuration that should make sense for potentially any event in that Snowplow pipeline
Schemas. You can add an arbitrary number of Iglu schema registries into a Snowplow pipeline for schema resolution, but again you can’t sandbox these to an individual
app_id, and performance would degrade if you had to check in 100 or 1000 registries for any given schema
But of course it all depends on what you want to do. If you have a narrow scope for your project and want to use the same set of schemas and enrichments for N companies, the above won’t be issues.
Back to your suggestion:
A solution seems to be using different “Iglu schemas” for each client with same structure but different vendor (so they will create different tables). But this comes with the redundancy of managing that “Iglu schemas”.
I think this would be a lot of work to maintain, and it doesn’t solve how to handle a) the atomic.events table and b) all the various schemas that are generated by Snowplow trackers and enrichment.
The other challenge with shared database-separate schema is that you are going to make the load transaction extremely long in duration, because you are loading:
average number of tables loaded per customer * number of customers
I think it’s worth looking at what you are achieve:
If you need customer-specific schemas or enrichments, and/or the ability for individual customers to do ad hoc analysis of their data individually, then set up an individual end-to-end Snowplow pipeline for each customer. The easiest way of managing all this is to become a Snowplow Managed Service reseller - get in touch if you are interested.
If you have shared schemas/enrichments across all customers, and some generalised aggregations that you want to apply to all customers, consider writing a multi-tenanted aggregation process in Spark and then loading the outputs for individual customers into either individual schemas in Postgres, or a shared schema. Our Data Engineering Services team is writing processes like these for some of our customers right now.
Anyway - a great topic and lots of food for thought. Let us know what you end up doing!