Problem on downloading maxmind data from S3 using Scala Stream Enrich 0.9.0


#1

Hi all, first of all, this is a copy of an issue I created on Github.
Stream Enrich is failing to download the Maxmind data sets from s3.
I’ve setup my ip_lookups.json as per this wiki page like this:

{
    "schema": "iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/1-0-0",
    "data": {
        "name": "ip_lookups",
        "vendor": "com.snowplowanalytics.snowplow",
        "enabled": true,
        "parameters": {
            "geo": {
                "database": "GeoIPCity.dat",
                "uri": "s3://my-private-bucket.s3.amazonaws.com"
            },
            "isp": {
                "database": "GeoIPISP.dat",
                "uri": "s3://my-private-bucket.s3.amazonaws.com"
            },
            "organization": {
                "database": "GeoIPOrg.dat",
                "uri": "s3://my-private-bucket.s3.amazonaws.com"
            }
        }
    }
}

And Stream Enrich throws this error:

[main] ERROR com.snowplowanalytics.snowplow.enrich.kinesis.KinesisEnrichApp$ - Error downloading s3:/my-private-bucket.s3.amazonaws.com/GeoIPCity.dat: java.lang.IllegalArgumentException: The bucket name parameter must be specified when requesting an object
Exception in thread "main" java.lang.RuntimeException: Attempt to download s3:/my-private-bucket.s3.amazonaws.com/GeoIPCity.dat to ./ip_geo failed
    at com.snowplowanalytics.snowplow.enrich.kinesis.KinesisEnrichApp$$anonfun$12.apply(KinesisEnrichApp.scala:172)
    at com.snowplowanalytics.snowplow.enrich.kinesis.KinesisEnrichApp$$anonfun$12.apply(KinesisEnrichApp.scala:154)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at com.snowplowanalytics.snowplow.enrich.kinesis.KinesisEnrichApp$delayedInit$body.apply(KinesisEnrichApp.scala:154)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.App$$anonfun$main$1.apply(App.scala:71)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
    at scala.App$class.main(App.scala:71)
    at com.snowplowanalytics.snowplow.enrich.kinesis.KinesisEnrichApp$.main(KinesisEnrichApp.scala:71)
    at com.snowplowanalytics.snowplow.enrich.kinesis.KinesisEnrichApp.main(KinesisEnrichApp.scala)

I was told to fix the urls here like this:

            "geo": {
                "database": "GeoIPCity.dat",
                "uri": "s3://my-private-bucket"
            },
            "isp": {
                "database": "GeoIPISP.dat",
                "uri": "s3://my-private-bucket"
            },
            "organization": {
                "database": "GeoIPOrg.dat",
                "uri": "s3://my-private-bucket"
            }

And the process is returning the same error. Any pointers?
Note: I’ve used EC2 roles (on AWS) to grant that instance full access to s3.

Thanks!


#2

Hi @juanstiza, I have re-opened your Github ticket. There is a regression error in how we parse s3 URIs. I have posted a workaround in the ticket here.

Could you please give that a try and let us know if that resolves the issue for the moment?


#3

Still no go with auth… the EC2 instance has an IAM role attached with the AmazonS3FullAccess managed policy, I used the aws cli to download the files with no issues (which might be the solution right now), does the app reads the access keys from the metadata? should I attach an IAM user to the s3 uri?


#4

Hi @juanstiza, thats really odd. The application creates an S3 Client based on the credentials supplied by its config. If these are set as iam in the config then it should use the same credentials as the aws cli - assuming you ran aws cli from the same instance?

Could you try putting in access and secret keys into the stream enrich config to see if it will then be able to download?


#5

@josh As we are testing or own storage process, we are using pipes and we did not set the configuration for AWS on the file as Kinesis wasn’t needed. Still, I find it strange that the Client is not able to use the permissions granted by the IAM role as this doc states (point 4).
Anyhow, we might just create access keys and use them. I’ll test that later today.
Thanks!


#6

It can use IAM, ENV or hardcoded values. In your config for stream enrich you will need to tell the application which ones to look for in the chain.


#7

Sorry, now I understand (too early here heh). I’ll test today, but most likely will work. Thanks again!