Filtered topic streams


#1

Hello there

We have begun implementing Snowplow as our main consumer events pipeline at work.
Really happy with how far you can go with out-of-the-box configurations and guides. Kudos to the team and everyone involved.

One of the requirements we are fielding from the business right now is this - individual teams want a filtered view of events that they care about - in realtime, of course.

The context is that we are using a ‘fat’ pipeline to process all the consumer events. All of it gets sinked in destinations like S3 and Google Bigquery - and hence filtering out the batched data is a solved issue.
We would love to have a similar solution for a “filtered realtime view” of events flowing through.

I’d love to get your thoughts/ experiences around this topic.


#2

If you only need a subset of events (and you don’t want to read off the main stream) one option is to use Kinesis Tee or a similar Lambda/processor to split the fat stream out into individual topics that have been filtered according to a condition.

If you’re using Lambda one thing you’ll have to be wary of is scaling these (it’s easier to just set a conservatively high number of shards) based on the read/write requirements for that stream.


#3

Thanks @mike for sharing your thoughts.
We landed on similar ideas and conclusions as you mentioned above.

I’ll keep you posted on how we progress.


#4

One of the ideas we are bouncing around in the team is to consume from the fat stream into multiple Kinesis Firehoses and each of it filter using the ‘processor’ lambda that decides which records to ‘keep’ and deliver them into S3 in micro-batches. Individual teams could listen to S3 notifications and consume the data as they see fit.

Does anyone see any immediate/obvious pitfalls with this approach?

The reason we are thinking about Firehose is to ensure there’s no back-pressure on the rest of the pipeline.