When implementing the real time pipeline, handling a lot of data, I ended up with this architecture:
AWS: [Step 1] -> Load balancer -> [Step 2] -> 3 collectors instances -> [Step 3] -> kinesis [6 shards for collectors output] -> [Step 4] -> 3 enrichment instances -> [Step 5] -> kinesis [6 shards for enrichment output] -> [Step 6] -> sink instance, 3 process sinking into -> Elasticsearch [Single node]
But when I was at a debug session to identify where I was “losing data”, I realized that I could send the output/stdout of [Step 2] directly to enrichments process on the same instance, cutting [Step 2], 1 kinesis stream with 6 shards at [Step 3] and eliminating 3 instances for [Step 4].
The output of enrichment process is sent to Kinesis just because I cant send data directly to Elasticsearch if my input is from stdin.
Does it make sense? What are the cons about this decision?
thanks in advance,