Having trouble with the EMR loader consuming in stream mode


#1

As referenced in the comments I made in my thread Confused about Stream Enrich -> S3Loader Step

I clearly have the s3 loader data coming into the specified bucket, my s3 loader is compressing in LZO (is this not supported?) but the output of running the emr etl runner says there is no run data:

uri:classloader:/gems/avro-1.8.1/lib/avro/schema.rb:350: warning: constant ::Fixnum is deprecated
D, [2018-09-26T21:33:10.928287 #4550] DEBUG -- : Initializing EMR jobflow
E, [2018-09-26T21:33:13.490600 #4550] ERROR -- : No run folders in [s3://piv-stream-data-prod-bucket/] found
F, [2018-09-26T21:33:13.499486 #4550] FATAL -- :

Snowplow::EmrEtlRunner::UnexpectedStateError (No run folders in [s3://piv-stream-data-prod-bucket/] found):

I am using R109 of the etl runner, as I noticed the bugs with the stream stuff between R102 and R104. Is there something I have wired up incorrectly?


#2

Nope. ETL-EMR eats only GZIPs.


#3

@dbuscaglia so the confusion is around what type of data EMR ETL Runner is processing. If you are running from the “raw” output (i.e. the direct output of the stream collector) then you need to use LZO. If you are running from the “enriched” output (stream enrich output) then you need to use GZIP.