Configuring Batch + Real-time Pipelines in Parallel

tfinkel · May 11, 2016, 3:34pm

My goal is to set-up the batch + real-time pipelines in parallel so that we can access the real-time event data in Elasticsearch while also loading via batches to Redshift.

From my research (across Discourse and the legacy Google Group), it appears that this is possible, by doing the following -

Set-up the Scala Stream collector, write to a Kinesis stream
[batch] Set-up the Kinesis LZO S3 sink to write the raw events to S3
[batch] Configure the EMR ETL Runner / StorageLoader to enrich + load thrift records from S3
[real-time] Set-up a parallel stream enrichment process
[real-time] Set-up the Kinesis Elasticsearch Sink

Is this correct?

ihor · May 11, 2016, 4:01pm

Hi @tfinkel,

Exactly right you are!

Bear in mind that this scenario implies essentially two enrichment processes (one for each of the pipeline). We are working on “unifying” this step to be common for both of the pipelines running in parallel.

Regards,
Ihor

tfinkel · May 11, 2016, 4:09pm

Thanks @ihor, I’m glad to hear it.

Is there an example of the configuration for this process? Essentially, I’m trying to understand how to feed each enrichment processes (batch + real-time) from the same collector output.

Thanks again!

ihor · May 11, 2016, 6:17pm

@tfinkel,

Hopefully, the below “workflow diagram” showing one of the typical scenarios will clarify it.

Stream Collector
     |-> Kinesis Raw Events Stream
          |-> [BATCH]: Kinesis S3 -> Amazon S3 (raw) -> EMR -> Amazon S3 (shredded) -> Redshift
          |                                \ Orchestrated by EmrEtlRunner & StorageLoder /
          |-> [REAL-TIME]: Kinesis Enrich 
                               |-> Bad Events Stream -> Kinesis Elasticsearch Sink -> Elasticsearch
                               |-> Enriched Events Stream -> Kinesis Elasticsearch Sink -> Elasticsearch
                                       |-> (optional) Custom stream job (ex. AWS Lambda)

In short,

Raw events stream is common for both pipeline
Batch: standard batch pipline with raw:in bucket (in config.yml) pointing to Amazon S3 (raw) in the diagram
Real-Time: You would need at least 2 streams (for enriched and bad events)

–Ihor

tfinkel · May 11, 2016, 6:35pm

Got it – we can run both the batch and real-time processes on all of our data from a single Kinesis raw events stream.

Thanks for the clarification!

yangabanga · January 16, 2023, 10:01pm

Hi all this is an old thread but we are looking to do a similar thing as in have parallel pipelines where we are batching into redshift. I read elsewhere that snowplow has moved away from this but for us this is the right architecture as it is vastly cheaper to batch into redshift.

Is this still possible? We implemented the bootstrap to try out snowplow and connected kinesis to redshift however behind the scenes it is constantly hitting redshift which is driving our costs up unnecessarily. we would be find with batching a few times a day / daily.

Pavel_Voropaev · January 17, 2023, 12:00am

@yangabanga spark transformer is actively maintained, but we do not provide a terraform stack to go with it.

You could also consider increasing window size here in steam transformer to 60min or longer to get fewer batches. It is likely to be a lot cheaper than running EMR, even at minimum configuration.

Topic		Replies	Views
Real-time pipeline AWS real-time pipeline	2	1839	May 24, 2018
Real-time pipeline reprocessing AWS real-time pipeline	15	2208	February 1, 2018
Cutting one step on real time pipeline : stream-collector > kinesis > elasticsearch AWS real-time pipeline	2	2180	April 12, 2016
Does the Kinesis LZO S3 Sink support reading from an "enriched" stream? AWS real-time pipeline	12	4048	May 4, 2018
Reading data from raw stream to both batch and real time For engineers	2	765	November 27, 2017

Configuring Batch + Real-time Pipelines in Parallel

Related Topics