We are currently working with emr etl runner (ver. 104) and use cloudfront as our collector.
We would like to achieve near real time events processing, step-by-step, first replacing the cloudfront collector with scala stream collector.
I’ve setup the collector, which works with kinesis stream.
The data is being consumed by kinesis firehose which saves the data into s3.
So far, everything is working.
Then I found out the the record format is different for the collectors and that I need to use Kinesis LZO S3 Sink to consume the data from kinesis firehose and save it to s3 in the right format so that the emr etl runner would be able to process it.
I looked into the documentation of it but it seems that the repository no longer exists.
So, anyone knows if there is a new project for that or any other solution?
- we are trying to set things up in docker containers so a solution that was built for that would be highly appreciated
Sorry, I found the repository.