Setting callbacks for session initialization

shimpeko · January 17, 2022, 3:45pm

Question
Is there any way to hook (set callback function) for session_id initialization?

Background
We want to send a field about clients’ status from clients to our collector endpoint only when a session is initialized, to minimize the amount of data transferred. Because the field is large, about 2KB.

Although it is possible to send an event when the field is updated (for example with ‘the_field_updated’ event) but we think having the field with the 1st event of a session helps us to add (enrich) the field to all events within the session by the enricher.

In this case, we are planning to create a custom enricher. The enricher will store a session_id and the field in a key-value store when it first received the field for a session_id. And then, it can use the key-value store to enrich subsequent events by looking up value for the field with using session_id (as a key).

Colm · January 17, 2022, 4:21pm

So the session ID will be established on the creation of the first event in the session. I don’t see a straightforward way to instrument a callback for that, but it’s probably within the art of the possible.

However, there’s another pitfall here, which is worth considering. The order in which the events are processed is not guaranteed, and is in fact quite likely to be out of order.

The events may not be sent from the client to the collector in order (when there’s a spotty connection for example). After they reach the collector, they get processed in several mutli-threaded asynchronous systems - kinesis streams with multiple shards, collector and enrich apps with multiple instances. Even within an instance each component is multi-threaded and there’s no guarantees around event sequencing.

For these reasons, I don’t believe that the enricher is the best place for this logic, because the only way I see to ensure that the data is joined up when out-of-order, is to either send with every event or to somehow wait for the value when it’s not there, which would have detrimental effects on the pipeline.

Generally when I’ve needed logic where events depend on other events, I have found that post-enrichment is the solution.

A simple example of this is to send the client status data as it’s own custom event, and join it to the other events in the session using SQL, once it’s in the warehouse. If it needs to be real-time, then it’s more complicated, but an app that consumes the enriched stream is at liberty to instrument this more complex logic without affecting the other services in the pipeline.

I hope that’s helpful

EddieM · January 17, 2022, 5:27pm

Hi @shimpeko
Welcome to the Snowplow community.
Kind regards,
Eddie

shimpeko · January 18, 2022, 4:50pm

@Colm Hello Thank you very much for the quick and detailed replay! The comment about ordering guarantee is particularly helpful.

May I ask some questions about the ordering guarantee?

Question1: I see an option named “useIpAddressAsPartitionKey” for collector+kinesis which seems to provide guarantee order for events with same IP. And if we set true for this option, we can ensure the order of events (from the same IP address). Is my understanding correct? Assuming that a client keeps a unique IP address during a session is reasonable in the real world?

Question2: Does the Snowplow Collector provides a similar option for Kafka?

Thanks,
Shmpeko

Colm · January 18, 2022, 7:04pm

Neither are my area, so I’m not entirely certain of the specifics, however:

Question1: I see an option named “useIpAddressAsPartitionKey” for collector+kinesis which seems to provide guarantee order for events with same IP. And if we set true for this option, we can ensure the order of events (from the same IP address). Is my understanding correct? Assuming that a client keeps a unique IP address during a session is reasonable in the real world?

I don’t believe this provides anything like a guarantee of the order of events. The partition key just determines which shard an event is allocated to, so all this would do is ensure that this allocation is consistent for events with the same partition key. (Note, however, that how data is allocated to shards is down to how AWS internally provides this allocation, and since shard counts can change I don’t know what the behaviour is in terms of any guarantees there.)

None of that says anything about the order of events.

Question2: Does the Snowplow Collector provides a similar option for Kafka?

I’m not sure of this.

However, on both counts, even if assurances of ordering from the collector to the raw stream did exist, you still wouldn’t have any assurance that the data would be in order, in the way that your solution presupposes. There is no way to assure that the data arrives at the collector in order. So before you even reach those settings, your data will be out of order.

It is quite common for a client connection to be unavailable for the first few events in a session, which then get forwarded to the collector later, when the connection is restored.

The best advice I can give on this topic is to reconsider your proposed architecture - the enricher is not suited to cross-event operations, but I haven’t yet come across a requirement that requires this which isn’t possible using some post-enrich process.

I hope that’s helpful.

shimpeko · January 25, 2022, 12:58pm

Sorry haven’t replied for a while and thank you again for the detail and quick response.

I agree that it seems not possible to guarantee the order of events and we should avoid implementing enrichments that rely on the order of events in our data pipeline.

I don’t believe this provides anything like a guarantee of the order of events. The partition key just determines which shard an event is allocated to, so all this would do is ensure that this allocation is consistent for events with the same partition key.

Still not 100% sure but I think Kinesis does guarantee the order of events at a shard level but not a stream level. Amazon Kinesis and guaranteed ordering - Stack Overflow

But even if a shard guarantees the order of events, I think scaling up the collector to more than two instances will break the guarantee as we cannot ensure the order of writes between two or more collectors. Also, as you pointed out, operation like changing the number of shards seems to break the guarantee.

Thank you much.
shimpeko

Colm · January 25, 2022, 1:11pm

Thanks for linking that, good to know! I’m very used to designing for unordered data, so I’m not so familiar with these features.

Glad I could be helpful, hope you find a workable solution.

Topic		Replies	Views
Collector is sending empty raw events Collectors	3	2132	August 22, 2018
Restarting Enrichment Enrichment	2	924	August 21, 2020
Querystring is empty: no raw event to process - upgrading collector to latest version Enrichment	5	1378	September 27, 2018
Data not processing in Stream Enrich Enrichment	0	1203	July 27, 2018
Partition Key for Kinesis AWS real-time pipeline	5	7199	December 11, 2016

Setting callbacks for session initialization

Related Topics