True multi-tenancy support in Snowplow


#1

We are considering using Snowplow for a multi-tenancy SaaS project. So far, app_id seems to be a property to differentiate events between clients. But, app_id is just a field and querying for different clients would not be efficient as they require iterating through all rows.

Let’s say If we have 100 customers, %99 of the rows are irrelevant with the query.

One of the practices for a multi-tenancy service is to separate them in different “database schemas”, so every table contains data that belongs to a separate client. A solution seems to be using different “Iglu schemas” for each client with same structure but different vendor (so they will create different tables). But this comes with the redundancy of managing that “Iglu schemas”. What do you think?


#2

I am dealing with the same decision myself. One way I see it is to break it down from the main events table and after the storage loader has run synch the events table to the specific tenant events table so the queries work across all clients but only hit the specific data for that client. In addition I think this allows for easier customization down the road if you need to join in specific tables only for some clients.


#3

Hi @orcuna, this is a great topic, thanks for raising.

I think it’s helpful first to define what we mean by multi-tenancy (“MT”), given that it’s quite an overloaded term. I like this answer to a question about MT:

Following the definitions in that answer, you are suggesting the second type of MT:

\2. A shared database, separate schema.

First off: end-to-end multi-tenancy of Snowplow was never a design goal, although you are right that a lot of users have used the app_id to achieve the third type of MT:

\3. A shared database, shared schema

Even if you are happy with this approach, there are at least two areas where Snowplow makes end-to-end MT difficult:

  1. Enrichments. There is no concept in Snowplow of "only apply this enrichment configuration to events with this app_id". To put it another way, each enrichment is a singleton, with a singular configuration that should make sense for potentially any event in that Snowplow pipeline
  2. Schemas. You can add an arbitrary number of Iglu schema registries into a Snowplow pipeline for schema resolution, but again you can’t sandbox these to an individual app_id, and performance would degrade if you had to check in 100 or 1000 registries for any given schema

But of course it all depends on what you want to do. If you have a narrow scope for your project and want to use the same set of schemas and enrichments for N companies, the above won’t be issues.

Back to your suggestion:

A solution seems to be using different “Iglu schemas” for each client with same structure but different vendor (so they will create different tables). But this comes with the redundancy of managing that “Iglu schemas”.

I think this would be a lot of work to maintain, and it doesn’t solve how to handle a) the atomic.events table and b) all the various schemas that are generated by Snowplow trackers and enrichment.

The other challenge with shared database-separate schema is that you are going to make the load transaction extremely long in duration, because you are loading:

average number of tables loaded per customer * number of customers

I think it’s worth looking at what you are achieve:

If you need customer-specific schemas or enrichments, and/or the ability for individual customers to do ad hoc analysis of their data individually, then set up an individual end-to-end Snowplow pipeline for each customer. The easiest way of managing all this is to become a Snowplow Managed Service reseller - get in touch if you are interested.

If you have shared schemas/enrichments across all customers, and some generalised aggregations that you want to apply to all customers, consider writing a multi-tenanted aggregation process in Spark and then loading the outputs for individual customers into either individual schemas in Postgres, or a shared schema. Our Data Engineering Services team is writing processes like these for some of our customers right now.

Anyway - a great topic and lots of food for thought. Let us know what you end up doing!