We want to know how does kinesis stream enrich work internally? Is its working different from emretlrunner? In the case of EMR we can scale it by increasing instances but if kinesis stream enrich is used to perform enrichments then how can we scale it?
Spark Enrich and Scala Stream enrich share some code in common (scala-common-enrich) but they fundamentally scale in different ways. For Kinesis stream enrich you’re going to want to scale up shards which will increase both your write and read throughput - you’ll typically want to match the number of KCL (Kinesis Client Library - stream enrich uses this under the hood) workers to how many shards you have.
There’s a nice example of the relationship between record processors, workers, instances and shards in the AWS documentation here.
So… if we have 200 shards on Kinesis stream that is used by enrichers, do I need 200 enricher instances to perform at best(1 enricher instance == 1 KCL worker)? Is there a way to increase the number of KCL workers from each enricher instances in EC2 instance?
One worker can process more than one shard so you generally want to have fewer instances than shards, exactly how many will depend on throughput, message size, desired latency etc.