Scala Stream Collector - scaling

kjcsb · August 17, 2016, 10:00am

Just getting started with Snowplow - congratulations it looks amazing.

What are the recommendations for running the Scala Stream Collector at scale? I see the Clojure Collector has a recipe for Elastic Beanstalk. Does the same approach apply to the Scala Collector?

alex · August 17, 2016, 10:24am

Hi @kjcsb - it’s a good question.

With the AWS real-time pipeline, you have lots of workers running all the time - not just the collectors but also Stream Enrich, ES Sink, S3 Sink etc. We have found that Elastic Load Balancers plus Auto-Scaling Groups have been a good fit for these - using these directly has all the upside of Elastic Beanstalk but with less magic to go wrong.

Shin · August 17, 2016, 10:50am

I’ve been meaning to ask the same question for a while but in terms of Enrich and the other Kinesis apps.

Autoscaling the collectors makes sense to me because they all write to the same stream (so I just need to make sure there’s enough capacity). But how does running multiple workers of Enrich work?

Is it as simple as making sure I shard the Kinesis streams, run multiple workers and let the KCL library do the rest?

alex · August 17, 2016, 11:43am

Hi @Shin,

Pretty much this - though substitute “workers” for “servers.” You have one KCL instance (i.e. one Stream Enrich or similar) per server, but you may have more than one worker inside each KCL. You can have more workers than shards, but no more than one worker working on one shard at a time.

The whole thing is a bit more complicated than it should be - we have developed an in-house scaling and monitoring platform for real-time called Tupilak, which we hope to open-source later this year. We’ll do a preview post on this new tech (it’s pretty exciting) in a month or so…

kjcsb · August 18, 2016, 6:52pm

Thanks, that clarifies it.

spatialy · January 25, 2017, 7:51pm

Hi Alex

Any updates on the Tulipak release?

We are making test with SP and we are sure in production we need to apply some similar solution to manage the scaling.

Best

alex · January 25, 2017, 10:37pm

Hi @spatialy,

We’ve been using Tupilak in production with our Managed Service RT customers since last year - it’s been working well. You can find out more about Tupilak here:

Tupilak is one of the core components of the Managed Service RT so it’s unclear to us at this point if/when we’ll open-source it.

spatialy · January 25, 2017, 10:38pm

Hi @alex

Thanks for the info

Topic		Replies	Views
Making the Stream Enricher Highly Available (autoscaling group) Enrichment	12	3251	November 10, 2016
Stream Enrich in Kubernetes cluster AWS real-time pipeline	4	1746	April 12, 2019
Cutting one step on real time pipeline : stream-collector > kinesis > elasticsearch AWS real-time pipeline	2	2166	April 12, 2016
Scaling kinesis enricher for high loads Enrichment	11	2144	December 11, 2018
Compute profiles of Scala Collector & Enricher Enrichment	3	1330	November 29, 2016

Scala Stream Collector - scaling

Related Topics