Configuring Batch + Real-time Pipelines in Parallel


#1

My goal is to set-up the batch + real-time pipelines in parallel so that we can access the real-time event data in Elasticsearch while also loading via batches to Redshift.

From my research (across Discourse and the legacy Google Group), it appears that this is possible, by doing the following -

  • Set-up the Scala Stream collector, write to a Kinesis stream
  • [batch] Set-up the Kinesis LZO S3 sink to write the raw events to S3
  • [batch] Configure the EMR ETL Runner / StorageLoader to enrich + load thrift records from S3
  • [real-time] Set-up a parallel stream enrichment process
  • [real-time] Set-up the Kinesis Elasticsearch Sink

Is this correct?


#2

Hi @tfinkel,

Exactly right you are!

Bear in mind that this scenario implies essentially two enrichment processes (one for each of the pipeline). We are working on “unifying” this step to be common for both of the pipelines running in parallel.

Regards,
Ihor


#3

Thanks @ihor, I’m glad to hear it.

Is there an example of the configuration for this process? Essentially, I’m trying to understand how to feed each enrichment processes (batch + real-time) from the same collector output.

Thanks again!


#4

@tfinkel,

Hopefully, the below “workflow diagram” showing one of the typical scenarios will clarify it.

Stream Collector
     |-> Kinesis Raw Events Stream
          |-> [BATCH]: Kinesis S3 -> Amazon S3 (raw) -> EMR -> Amazon S3 (shredded) -> Redshift
          |                                \ Orchestrated by EmrEtlRunner & StorageLoder /
          |-> [REAL-TIME]: Kinesis Enrich 
                               |-> Bad Events Stream -> Kinesis Elasticsearch Sink -> Elasticsearch
                               |-> Enriched Events Stream -> Kinesis Elasticsearch Sink -> Elasticsearch
                                       |-> (optional) Custom stream job (ex. AWS Lambda)

In short,

  • Raw events stream is common for both pipeline
  • Batch: standard batch pipline with raw:in bucket (in config.yml) pointing to Amazon S3 (raw) in the diagram
  • Real-Time: You would need at least 2 streams (for enriched and bad events)

–Ihor


#5

Got it – we can run both the batch and real-time processes on all of our data from a single Kinesis raw events stream.

Thanks for the clarification!