Partition Key for Kinesis


#1

Hey guys, i saw that as of R78 - Kinesis partition keys were set to random (unless specified for using IP address in config). I was wondering if you guys ever thought about using one of the ID’s instead, to guarantee event order by user.

I understand the argument for flooding by one user/IP/etc. packing a shard full, but seems like a high price to pay by randomly assigning and losing guaranteed order by user.

We were having a discussion similar to Uber/Lyft and taxi rides by user, and if events got out of sync, it would def. muddy up the waters on who rode in what cab to where and which driver.

I saw @alex had a thought of allowing a configurable partition key using Kinesis Tee - have you thought about or can you provide reasoning behind not having that option in stream enrich and sink?

Ref:


https://github.com/snowplow/kinesis-tee/issues/2


Help using value in enriched data as a key in kafka message
#2

Hi @13scoobie - making the partition key fully user-configurable in Stream Enrich sounds like a good idea - I’ve updated this ticket to factor in your suggestion:

https://github.com/snowplow/snowplow/issues/1924

Obviously caveat utilitor around hot spotting issues if you pick a bad partition key.

With the Scala Stream Collector - I’m not sure if there’s much value in making the partition key user configurable - there aren’t many options in the Thrift collector payload. Let me know if you disagree.

I was wondering if you guys ever thought about using one of the ID’s instead, to guarantee event order by user.

This doesn’t strictly guarantee event order - it just guarantees the order of events by when they arrived at the collector. The derived_tstamp is what really tells you the event order for a given user. You’d need some stateful stream processing with a time window to re-sort the events by derived_tstamp if sequence is super-important…


#3

Love this conversation! Sharing a common event ordering on the stream to ensure that we can run multiple systems and achieve the same result is a different problem from re-sorting by timestamp. In this case we want all activity from a specific user to end up in a single kinesis shard so that we can avoid conflict and manage the user’s activity synchronously when necessary. If the partition key for events is randomized, we will introduce race conditions and increase the complexity of the code.

If order can be guaranteed within a chosen partition key (like userID) then we can apply consistent business logic across applications to ensure that old timestamped events are processed identically and reach eventual consistency in our distributed systems.

Thanks!


#4

Thanks for clarifying that Rob, makes total sense…


#5

…and yes we hear you on the potential for hot shards, DDoS, and misbehaving users causing issues.


#6

I’m not sure order can be really guaranteed, as kinesis offers the “eventual consistency of at least once”. Meaning: the enricher can potentially process the message twice at different time and screw up the order no matter what. If you really need to make sure the order of events per user is correct, you would need something like apache flink behind the enricher, that has deduping and reordering capabilities at stream level.