My goal is to set-up the batch + real-time pipelines in parallel so that we can access the real-time event data in Elasticsearch while also loading via batches to Redshift.
From my research (across Discourse and the legacy Google Group), it appears that this is possible, by doing the following -
- Set-up the Scala Stream collector, write to a Kinesis stream
- [batch] Set-up the Kinesis LZO S3 sink to write the raw events to S3
- [batch] Configure the EMR ETL Runner / StorageLoader to enrich + load thrift records from S3
- [real-time] Set-up a parallel stream enrichment process
- [real-time] Set-up the Kinesis Elasticsearch Sink
Is this correct?