Scaling Kafka enricher

Hello everyone,

I am running a snowplow pipeline having a Kafka enricher in a local instance performing one API enrichment and one sql enrichment. Is there a way to scale up one single enricher instead of running the process multiple times? I am trying to figure out how the enricher uses threads and if this option configurable.

Thank you! :slight_smile:

Hi @bambachas79,

You can allocate more CPUs (= more threads) and more memory to your enricher, but bear in mind that with Kafka there can be only one thread consuming the partition of a topic, so even if you allocate 10 CPUs to your enricher but you have only 2 partitions, only 2 threads will be consuming from Kafka. So scaling doesnโ€™t depend only on the resources allocated to the enricher but also on the number of partitions for your topic.

Please do not hesitate if you need more help.

Thanks for your quick response @BenB!

In Kafka i am running 100 partitions, how can i alocate more CPUs to my enricher?

Having more than 100 CPUs on one machine seems like a lot and makes us lose the fault-tolerancy that we gain with Kafka and several instances consuming. Is there any particular reason why you want to run the enricher on only one machine instead of several?

How do you start your enricher?

i am running the command

`java -jar snowplow-stream-enrich-kafka-1.0.0.jar --config kafka_enrich.conf --resolver file:resolver.json --enrichments file:custom_enrichments` 

To run the enricher with 10 cores do i have to run the command 10 times? I want to run 10 enrichers in 10 machines.

When you use java command, your JAVA application (stream-enrich) will automatically use all the CPU cores that are available on the machine to run its threads. So if your machine has 10 cores and doesnโ€™t do much else than running your app, the 10 cores will automatically be used.

So you just need to run java once on a machine that has 10 cores.

You need to run the java command once per machine.

And in case of 1 Core? When i ran twice the java command in a single core instance, i got the double messages in comparison with a single java command. Why is that? I also gave the maximum heap size to the JVM. Neither cpu and ram were high