Snowplow Ruby Tracker with Async Emitter


#1

Hi,

I’m Lucas Souza, CTO of GetNinjas.

Currently, we are using a complex architecture in order to avoid the use of snowplow ruby tracker with the async emitter. So why do we do that?

Basically, because we have a Rails application running on a Unicorn web server. Since Unicorn works with different process and those can be killed when hitting a giving timeout, we are afraid to lose some events.

Our current architecture involves sending events to a file, a process that reads that file using Fluentd, sends those events to SQS and finally getting them and sending to Snowplow using a multi-threaded application written in Ruby.

It’s important to say real-time events are a prerequisite for us. We are not reaching it with this architecture (with a lot of steps), and actually, I think are too many steps in reality.

We would like to know if one of you have any better idea to solve that problem? Did you try Ruby Async Emitter with Rails Application running under Unicorn? Or do you have another architecture in mind to solve this real-time event for backend applications problem?

Best,


#2

Hi Lucas,

That does sound like a complicated pipeline! We have been mulling a couple of alternative collection architectures recently - primarily for the PHP Tracker but it sounds like they could work well for Rails/Unicorn.

Option 1: Socket collector

  • Adding a socket emitter to a given tracker
  • Writing a socket collector (probably in Golang or Rust) which listens on the socket and writes the events to Kinesis/Kafka/maybe S3, in our standard format
  • Obviously the socket collector stays behind your firewall

Option 2: Golang tracking daemon

  • Adding a socket emitter to a given tracker
  • Writing a Golang daemon that runs on each box
  • The Golang daemon embeds our Snowplow Golang Tracker
  • The Golang daemon will cache events in e.g. RocksDB
  • The Golang daemon will then send the events out to the regular HTTP collector

Do either of these sound interesting - does the community have some other ideas?


#3

Hi Alex,

Thank you for your fast reply,

I liked the second option more than the first one. But I have some questions:

  • Which socket emitter do you have in mind? I saw some implementations using Redis behind of scenes.
  • Snowplow Golang Tracker does not have a batch option?

Best,


#4

Another option is still using Fluentd to read trackings files and coding a Fluentd Output Plugin to send events to Snowplow.

What do you think?


#5

If you are already invested in Fluentd, then yes that option could work too; I don’t think you’d want to introduce Fluentd just for this use case though.

On the socket emitter - I just mean writing low-level TCP socket code to emit the events.

The Golang Tracker would be extended and then embedded in a long-running daemon which would handle the batching, storing and sending of Snowplow events. Some ASCII art:

Rails process + Snowplow Ruby Tracker ---socket--> Golang daemon + Golang Tracker ---http--> Snowplow collector

#6

Thanks for clarifying everything Alex,

Don’t you think to implement a Socket Emitter inside Snowplow Ruby Tracker is a kind of overhead? I mean, for me looks easier just sending events through TCP connection to a Golang Daemon listening to it, formatting those messages on Snowplow Pattern, storing it on RocksDB and, finally, sending it using HTTP collector:

Rails + TCP Socket (logstash, for example) --> Golang Daemon + Golang Tracker --> Snowplow HTTP Collector

What do you think?


#7

Are you talking performance overhead or cognitive overhead? I think the cognitive overhead of adding socket support to the Ruby Tracker is lower, because it means your client code is instrumented in the same way - using the standard Ruby Tracker API - whether you use a socket emitter or an HTTP emitter.

Performance-wise, I don’t see why there’d be any impact if the socket emitter was bundled as part of the Ruby Tracker versus being hand-rolled…

Maybe I’m misunderstanding your point?


#8

You got it. I was talking about cognitive overhead.

I agree with you when you say the client code is instrumented in the same way.

But, my point is: does it make sense having a socket which will transfer data in snowplow format and a golang daemon listening and transforming it before send to http collector?

Best,


#9

I guess we are saying that there are two options:

  1. Transfer the data in some kind of raw format to the Golang daemon, and the Golang daemon turns this format into the Snowplow Tracker Protocol
  2. Transfer the data, already in Snowplow Tracker Protocol format, to the Golang daemon, and the Golang daemon passes through the payloads as-is

Option 1 is slightly more efficient payload-wise, but it means that we have to create, document and maintain YAPF (yet another payload format), which isn’t something we have bandwidth to do. So I’d vote for option 2.


#10

Agree with you.

By the way, I’m trying Fluentd option. Today I’ll bring results about it.

But, definitely, if this won’t scale, your idea looks very promising.


#11

Cool, keep us posted how you get on!