We have created a firehose to read the data from collector good stream to store the data to S3. The intent of storing the data is
a. to help us replay the data to EMR or enrich, if there are issues with the downstream logic changes
b. We would like to use the data in the non-prod environment to build and test changes to the pipeline.
Currently the data is created as gzip file from firehose. Firehose does not have an option of converting the file to lzo and index.
I tried to downloaded lzop utility and pip lzo-indexer, I uncompressed the gzip file ran through lzo and lzo indexer utilities. When I tried to process the data using the snowplow emr, It did not produce and good data either in shredded or enriched.