Our setup is - StreamCollector > Raw Stream > Kinesis LZO S3 Sink > EmrEtlRunner > StorageLoader > Redshift
The collector format is “thrift”.
We’re getting a large volume of data into our collector which then gets pushed into “raw/in” S3 folder. In this scenario, when we run EmrEtlRunner, it is not able to push data from “raw/in” into “raw/processing” faster than the data that’s coming into “raw/in” from collector. Hence, it is stuck in “staging” step and is not able to progress into the EMR stage of batch pipeline.
- Is this setup correct?
- The documentation on “2-Using-EmrEtlRunner” states that it can run in a timespan mode for only “cloudfront” collector format which doesn’t work for us (since we’re using thrift").
- Can timespan mode be made to support thrift format?
- Can the timespan be made granular to include hours and minutes?