Hadoop dependencies in LZO S3 Sink

Hi All,

I am new to snowplow and I am following https://github.com/snowplow/snowplow/wiki/Kinesis-LZO-S3-Sink-Setup to build Kinesis LZO S3 Sink. However while sbt I get following error


[specs2.fixed.env150564385-1] DEBUG com.hadoop.compression.lzo.GPLNativeCodeLoader - location: /native/Linux-amd64-64/lib
9 [specs2.fixed.env150564385-1] DEBUG com.hadoop.compression.lzo.GPLNativeCodeLoader - temporary unpacked path: /tmp/unpacked-875910850732014887-libgplcompression.so
14 [specs2.fixed.env150564385-1] INFO com.hadoop.compression.lzo.GPLNativeCodeLoader - Loaded native gpl library from the embedded binaries
73 [specs2.fixed.env150564385-1] INFO com.hadoop.compression.lzo.LzoCodec - Successfully loaded & initialized native-lzo library [hadoop-lzo rev 52decc77982b58949890770d22720a91adce0c3f]
298 [specs2.fixed.env150564385-1] INFO org.apache.hadoop.conf.Configuration.deprecation - hadoop.native.lib is deprecated. Instead, use io.native.lib.available
306 [specs2.fixed.env150564385-1] DEBUG org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home directory
*va.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
_ at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:326)

_ at org.apache.hadoop.util.Shell.(Shell.java:351)_
_ at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)_
_ at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1437)_
_ at com.hadoop.compression.lzo.LzoCodec.isNativeLzoLoaded(LzoCodec.java:94)_
_ at com.hadoop.compression.lzo.LzoCodec.getCompressorType(LzoCodec.java:154)_
_ at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150)_
_ at com.hadoop.compression.lzo.LzopCodec.getCompressor(LzopCodec.java:171)_
_ at com.hadoop.compression.lzo.LzopCodec.createIndexedOutputStream(LzopCodec.java:82)_
_ at com.snowplowanalytics.snowplow.storage.kinesis.s3.serializers.LzoSerializer$.serialize(LzoSerializer.scala:68)_
_ at com.snowplowanalytics.snowplow.storage.kinesis.s3.serializers.LzoSerializerSpec$$anonfun$1$$anonfun$apply$1.apply(LzoSerializerSpec.scala:84)_
_ at com.snowplowanalytics.snowplow.storage.kinesis.s3.serializers.LzoSerializerSpec$$anonfun$1$$anonfun$apply$1.apply(LzoSerializerSpec.scala:65)_
_ at org.specs2.execute.AsResult$$anon$2$$anonfun$asResult$1.apply(AsResult.scala:21)_
_ at org.specs2.execute.AsResult$$anon$2$$anonfun$asResult$1.apply(AsResult.scala:21)_
_ at org.specs2.execute.ResultExecution$class.execute(ResultExecution.scala:23)_
_ at org.specs2.execute.ResultExecution$.execute(ResultExecution.scala:118)_
_ at org.specs2.execute.AsResult$$anon$2.asResult(AsResult.scala:21)_
_ at org.specs2.execute.AsResult$.apply(AsResult.scala:25)_
_ at org.specs2.specification.core.AsExecution$$anon$1$$anonfun$execute$1.apply(AsExecution.scala:15)_
_ at org.specs2.specification.core.AsExecution$$anon$1$$anonfun$execute$1.apply(AsExecution.scala:15)_
_ at org.specs2.execute.ResultExecution$class.execute(ResultExecution.scala:23)_
_ at org.specs2.execute.ResultExecution$.execute(ResultExecution.scala:118)_
_ at org.specs2.execute.Result$$anon$10.asResult(Result.scala:229)_
_ at org.specs2.execute.AsResult$.apply(AsResult.scala:25)_
_ at org.specs2.specification.core.Execution$$anonfun$result$2.apply(Execution.scala:193)_
_ at org.specs2.specification.core.Execution$$anonfun$result$2.apply(Execution.scala:193)_
_ at org.specs2.specification.core.Execution$$anonfun$withEnv$1$$anonfun$apply$5$$anonfun$apply$6.apply(Execution.scala:196)_
_ at org.specs2.execute.ResultExecution$class.execute(ResultExecution.scala:23)_
_ at org.specs2.execute.ResultExecution$.execute(ResultExecution.scala:118)_
_ at org.specs2.execute.Result$$anon$10.asResult(Result.scala:229)_
_ at org.specs2.execute.AsResult$.apply(AsResult.scala:25)_
_ at org.specs2.specification.core.Execution$$anonfun$withEnv$1$$anonfun$apply$5.apply(Execution.scala:196)_
_ at org.specs2.specification.core.Execution$$anonfun$withEnv$1$$anonfun$apply$5.apply(Execution.scala:196)_
_ at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)_
_ at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)_
_ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
_ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
_ at java.lang.Thread.run(Thread.java:748)_
330 [specs2.fixed.env150564385-1] DEBUG org.apache.hadoop.util.Shell - setsid exited with exit code 0


I checked build.sbt and found that there are following dependencies
Dependencies.Libraries.hadoop
Dependencies.Libraries.hadoopLZO

The machine that I am using for build the application is not part of any hapoop cluster and hadoop is not installed and that might be the reason for the error. Please note that after this exception, the jar is build and I can execute it.

The question is I understand that LZO s3 Sink will run on a m/c and will read data from Kinesis and store it on S3, What is purpose of Hadoop dependency here ? is there any use case that I am failing to understand ?

Is it safe to comment those dependencies ?

Hello Rajan,

I can’t say that I encountered this issue before even building on my own machine which doesn’t have any kind of hadoop installation.

However, those dependencies are needed to write LZO files to Hadoop-compatible file systems (such as S3).

Hi @rajan.patki - this is how we install the Hadoop dependencies for compiling:

I was following https://github.com/snowplow/snowplow/wiki/Kinesis-LZO-S3-Sink-Setup and it has instruction to install lzo


$ sudo apt-get install lzop liblzo2-dev


Isn’t that enough ?

Hi Ben,

I had installed lzo with sudo apt-get install lzop liblzo2-dev

if you look at the error above it says

It is looking for environment variable hadoop_home, how do we manage this when we don’t have hadoop instillation ?

Thanks,
Rajan

Since this is a debug log message, I think you can safely dismiss it as not being terribly important.