Can we use snowplow pipeline without collector?

Hi,

Is there a way to consume events from tracker to directly pubsub topic without setting up collector?

Thanks

Hi @Hanumanth,

No, the collector is a required piece of infrastructure for a Snowplow pipeline, and the trackers are designed to send data to collectors over http - you couldn’t just point them at a pubsub topic, you would need to use a pubsub API to insert data to pubsub.

The collector does a few things that make it necessary to the system - it sets cookies and handles the http protocol used in the clients, it handles the need for a publicly available endpoint to receive data from user devices (the data has to be sent over the public internet, and I’m unsure of what the security implications would be in making a pubsub topic available like this), it filters out some malformed events (when the format isn’t intelligible to the collector, it’s dumped to bad data). It also handles some nuanced things which are increasingly important for browsers, like CORS headers, and domains (look into the recent browser privacy restrictions and first vs third party data collection for more detail on these issues).

Most importantly though, the collector is the single most important part of infrastructure in a pipeline from the point of view of availability and avoidance of data loss - generally a collector sits behind a load balancer, and consists of multiple instances across multiple availability zones. It needs to be able to handle traffic spikes, network issues, cloud provider issues, and a host of other potential problems which might lead to data loss. In brief, without a high-availability collector in front of the pubsub topic, there is no assurance that you’ll actually receive the data.

If those things aren’t important considerations for the use case, then, instead of asking ‘can I use Snowplow without a collector?’, the question I would ask is ‘should I be using Snowplow for this use case?’.

I hope that all makes sense! Let us know if you have more questions. :slight_smile:

3 Likes

Hi @Colm ,

Thank you for the detailed information. Could you please help me with the below point?
you couldn’t just point them at a pubsub topic, you would need to use a pubsub API to insert data to pubsub

So, are you saying that we can use pubsub API to directly send events to pubsub topic? If yes, can you help me with that?

Thanks

I think perhaps what I was trying to say there might be unclear, let me try to clarify:

Snowplow is not designed to enable sending data directly to pubsub from a client. The reasons for that I’ve outlined in my previous message. It is simply a use case that is outside the scope of what Snowplow does - so there’s not really much help I can offer in doing it.

In general, if you need to get data directly into pubsub, there are pubsub APIs that are for that exact purpose - obviously using those is a matter for Google/their documentation.

If what you’re asking is ‘how do I use a Snowplow tracker, but use it to send data directly to pubsub?’ then the answer is that you can’t. If you absolutely must do things that way, then all the relevant code is open-source, so you’re welcome to fork it and try to build what you need. However doing that is not really something we can invest time in supporting/offering help with.

However, perhaps we can still be helpful. I don’t understand why you’re asking about this very specific approach. From where I stand, the approach you’re focused on sounds like a lot of work and a lot of maintenance problems (all in the name of disimproving the system as a whole). Perhaps those problems can be avoided but the goal can be achieved.

I think I’ve given as good an account as I can of why the collector is important and what it does - perhaps you can explain to me why those considerations don’t matter, and what the actual requirement here is? Why do you want to forgo the collector? What end goal are you trying to achieve?