Collector configurations


#1

Hi Folks,

We are planning to have a loadbalancer pointing to two scala collectors.
Our question is how to configure the load balances

  1. If we configure round robin pattern then the requests from same client can be distributed across both the collectors.However in this case both collectors will be writing to the same kinesis stream.Wont the events go out of sequence? How can the order of events be maintained? or it doesnt matter?

  2. Or do we have to ensure requests from same client goes to same collector so they reach in the same incoming order?
    while requests from another client can be configured to reach the second collector?


#2

There’s currently no ordering guarantees in Snowplow at the moment. Kinesis itself supports ordering within a shard of a Kinesis stream but not across shards within the same stream.

The order guarantees don’t matter too much if you’re loading to a target like Redshift (where you can sort data) or BigQuery (where you can partition data by date / time) but may impact you if you’re planning on doing some kind of stream processing which requires events to be in order e.g., some sort of real time sessionisation.


#3

To add to mike’s answer - you’ll have dvce_created_tstamp and derived_tstamp in the data, which are generated on a tracker level (sessionisation is done at tracker level in general too although you may have some use case to manually do it).

These two timestamps preserve the order in which the events were created - so most use cases in which order of events arise are covered without needing them to be processed in order. The tracker itself doesn’t care about the order in which it sends events, so even if you’ve instrumented the pipeline to preserve order, connectivity issues can cause events to be sent late/out of order. (Events will be cached if the tracker can’t contact the collector).

Best,


#4

Thank you guys for the quick clarification.