UnmatchedLzoFilesError after staging events


#1

I’ve had the batch pipeline running for over a week now and the other day the EMRETLRunner began to error out with the following message:

`D, [2016-06-18T00:05:07.233000 #8] DEBUG – : Waiting a minute to allow S3 to settle (eventual consistency)
D, [2016-06-18T00:06:07.237000 #8] DEBUG – : Initializing EMR jobflow
F, [2016-06-18T00:06:09.814000 #8] FATAL – :

Snowplow::EmrEtlRunner::UnmatchedLzoFilesError (Processing bucket contains 4775 .lzo and .lzo.index files, expected an even number):
/usr/local/bin/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:99:in initialize' /usr/local/bin/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:insend_to’
/usr/local/bin/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in call_with' /usr/local/bin/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:incommon_method_added’
/usr/local/bin/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:67:in run' /usr/local/bin/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:insend_to’
/usr/local/bin/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in call_with' /usr/local/bin/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:incommon_method_added’
file:/usr/local/bin/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in (root)' org/jruby/RubyKernel.java:1091:inload’
file:/usr/local/bin/snowplow-emr-etl-runner!/META-INF/main.rb:1:in (root)' org/jruby/RubyKernel.java:1072:inrequire’
file:/usr/local/bin/snowplow-emr-etl-runner!/META-INF/main.rb:1:in (root)' /tmp/jruby5459262531968505237extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in(root)’

Error running EmrEtlRunner, exiting with return code 1. StorageLoader not run`

I can clear out the events staged for processing and it will continue to run normally but eventually it always does this. I don’t want to lose a days worth of data each time this happens. What is causing this and how can I fix it?


#2

Hi @sphoid - never delete your raw event files just because you have some kind of load failure! Always do try to figure out the underlying problem and fix it.

If you delete your raw event files from S3 then you can never recover those events in the future.

In this case, this error is caused by the EmrEtlRunner checking that there is an even number of .lzo and .lzo.index files copied over to the staging - the implication being that if there is an uneven number, the move to staging has detached at least one .lzo file from its .lzo.index file.

When we wrote this functionality we didn’t know that actually Hadoop can process an .lzo file even if the .lzo.index file is missing, and will equally ignore a .lzo.index file if there is no accompanying .lzo file. As such, the UnmatchedLzoFilesError error in EmrEtlRunner makes the process more fragile without making it safer.

We will remove this functionality in the next EmrEtlRunner release: https://github.com/snowplow/snowplow/issues/2740

In the meantime, when this exception is thrown, a workaround is to:

  1. Delete a single .lzo.index file from the staging bucket (it doesn’t matter which index file you delete - all .lzo files will still be read by Hadoop)
  2. Restart the pipeline with --skip staging

#3

Ah. good stuff to know. Thanks, i’ll try the workaround.