Cutting one step on real time pipeline : stream-collector > kinesis > elasticsearch

aikeda · April 12, 2016, 1:27pm

Hi guys,

When implementing the real time pipeline, handling a lot of data, I ended up with this architecture:

AWS:

[Step 1] 
-> Load balancer ->

[Step 2] 
-> 3 collectors instances  ->

[Step 3] 
->  kinesis [6 shards for collectors output] ->

[Step 4]
-> 3 enrichment instances ->

[Step 5]
-> kinesis [6 shards for enrichment output] ->

[Step 6]
-> sink instance, 3 process sinking into -> Elasticsearch [Single node]

But when I was at a debug session to identify where I was “losing data”, I realized that I could send the output/stdout of [Step 2] directly to enrichments process on the same instance, cutting [Step 2], 1 kinesis stream with 6 shards at [Step 3] and eliminating 3 instances for [Step 4].

The output of enrichment process is sent to Kinesis just because I cant send data directly to Elasticsearch if my input is from stdin.

Does it make sense? What are the cons about this decision?

thanks in advance,
André

josh · April 12, 2016, 2:01pm

Hi @aikeda,

There is nothing wrong with this approach if you are expecting quite low amounts of information coming through your pipeline. This setup is exactly what we have done for Snowplow Mini, where we also pipe from Stream Enrich directly to Elasticsearch Sink (this ability was added in r78). So there are no Kinesis Streams used at all; and everything is contained to a single instance.

However in a situation where you have sudden spikes of events or just generally large amounts of events the ability to distribute and scale distinct applications in the pipeline is quite important. It does make sense if you have very low volumes of data but in any other situation it means you run the risk of your stack failing due to back pressure on other apps. The Kinesis Stream here allows you to queue vast amounts of events without any worry about back pressure.

If you would like to retain this ability to scale I would suggest using something like Snowplow Mini for your testing/debugging and go back to your original setup for a production environment.

Hope that helps!

aikeda · April 12, 2016, 9:14pm

thanks for your quick reply, @josh !

Topic		Replies	Views
Real-time pipeline AWS real-time pipeline	2	1837	May 24, 2018
Configuring Batch + Real-time Pipelines in Parallel For engineers	6	1916	January 17, 2023
Why is Snowplow using Kinesis/Kafka for real-time pipeline? AWS real-time pipeline	4	5845	July 12, 2016
Is it bad to sink data from kinesis stream directly to postgres? Storage targets	3	2668	August 24, 2016
Scala Stream Collector - scaling Collectors	7	3295	January 25, 2017

Cutting one step on real time pipeline : stream-collector > kinesis > elasticsearch

Related Topics