Emr-etl runner error


#1

Hello

I’ve been trying to setup snowplow and I’m having issues :frowning:
I follow the guide and got the elasticbean setup and I was able to get the emr-etl config.xml working with version r90, the emr cluster starts setting up but it errors when trying to download the rdb logs

root@snowplow:~/emr-etl# ./snowplow-emr-etl-runner --skip staging,archive_raw --config config/config.yml --targets config/targets/ --resolver resolver.json --enrichments file:./enrichments
D, [2017-08-16T14:01:22.054000 #6510] DEBUG -- : Initializing EMR jobflow
D, [2017-08-16T14:01:27.468000 #6510] DEBUG -- : EMR jobflow j-1X4LYCUAUP5XQ started, waiting for jobflow to complete...
I, [2017-08-16T14:01:27.481000 #6510]  INFO -- : SnowplowTracker::Emitter initialized with endpoint http://collector.namastetech.me:80/i
I, [2017-08-16T14:01:27.826000 #6510]  INFO -- : Attempting to send 1 request
I, [2017-08-16T14:01:27.842000 #6510]  INFO -- : Sending GET request to http://collector.namastetech.me:80/i...
I, [2017-08-16T14:01:27.896000 #6510]  INFO -- : GET request to http://collector.namastetech.me:80/i finished with status code 200
I, [2017-08-16T14:11:29.152000 #6510]  INFO -- : RDB Loader logs
D, [2017-08-16T14:11:29.635000 #6510] DEBUG -- : Downloading s3://ntech-snoplow-data/snowplow-log/rdb-loader/2017-08-16-14-01-22/3465bd73-393e-40a4-a622-6e4106b658af to /root/emr-etl/rdbloader20170816-6510-1qjquf2
E, [2017-08-16T14:11:31.563000 #6510] ERROR -- : Error while downloading RDB log s3://ntech-snoplow-data/snowplow-log/rdb-loader/2017-08-16-14-01-22/3465bd73-393e-40a4-a622-6e4106b658af
E, [2017-08-16T14:11:31.595000 #6510] ERROR -- : undefined method `body' for nil:NilClass
I, [2017-08-16T14:11:31.899000 #6510]  INFO -- : Attempting to send 1 request
I, [2017-08-16T14:11:31.909000 #6510]  INFO -- : Sending GET request to http://collector.namastetech.me:80/i...
I, [2017-08-16T14:11:31.991000 #6510]  INFO -- : GET request to http://collector.namastetech.me:80/i finished with status code 200
F, [2017-08-16T14:11:32.347000 #6510] FATAL -- :

In the config.xml I have log: s3://ntech-snoplow-data/snowplow-log

any idea what could be wrong? did I missed a step or something?

Thanks in advace


#2

Hello @jpdc,

Is it your first EmrEtlRunner run? I’m puzzled about why you’ve skipped staging step, as it is skipped usually if previous run failed. So far I have a feeling that enrich and shred jobs were just implicitly skipped because there’s no data.

If you’re aware of recovery process and this is not your first run could you please share following (with credentials removed):

  1. Your config/config.yml
  2. Log file at s3://ntech-snoplow-data/snowplow-log/rdb-loader/2017-08-16-14-01-22/3465bd73-393e-40a4-a622-6e4106b658af

#3

@anton

Thanks for reply

Yes, it was the first run – it was now working, I was just following the guide – I’m new to this so I’m still trying to figure this out

removing the --skip option did not give error

root@snowplow:~/emr-etl# ./snowplow-emr-etl-runner --config config/config.yml --targets config/targets/ --resolver resolver.json --enrichments file:./enrichments
D, [2017-08-16T16:34:51.415000 #7057] DEBUG -- : Staging raw logs...
  moving files from s3://ntech-snoplow-data/old-data/ to s3://ntech-snoplow-data/processing/
  moving files from s3://ntech-snoplow-data/ to s3://ntech-snoplow-data/processing/ 

what is the --skip staging does? also why it did not started the EMR cluster? according to the documentation

Invoking EmrEtlRunner with just the --config option puts it into rolling mode, processing all the raw Snowplow event logs it can find in your In Bucket:


#4

nvm i figured,

question tho, I have this setup since yesterday and barely some logs from elasticbean bucket, I see the cluster is stuck in

Elasticity Spark Step: Enrich Raw Events Running 2017-08-16 12:08 (UTC-5) 57 minutes

is it normail that is taking so long? the master and core instance are m1.medium – should be enought for testing… it worries me what would happen when is live lol