Storage loader fails with S3 API errors when using IAM credentials


#1

I’d prefer to use IAM based credentials if at all possible for all stages of the Snowplow pipeline, to avoid having configuration file with hardcoded, plaintext key/secret settings.

I’ve been trying to run the storage loader with the key and secret specified as “iam”, after looking through the source code to verify that’s the most likely way to signal that to the code. But after extensive debugging, including an AWS ticket to review the S3 logs from the errors, I’m starting to think that’s not supported.

While I see S3 API’s being sent, including the correct IAM/instance access key and some sort of secret, the API calls fail with the following error and the program exits.

— cut here —

<?xml version=\"1.0\" encoding=\"UTF-8\"?>\nInvalidAccessKeyIdThe AWS Access Key Id you provided does not exist in our records.ASIAJWUD6PFHBCXXXXXXetc...

— cut here —

Is there a way to get the storage loader to leverage IAM roles assigned to the instance the code is running on? Or do I have to revert to putting key/secret into for an IAM user into the config file?


#2

Hi @cnamejj - the S3 API errors you mention are probably related to the archive step post-load. This step doesn’t support IAM roles because it uses a library, Sluice, which doesn’t support IAM roles: https://github.com/snowplow/sluice/issues/31

We don’t have a timeline on adding IAM to Sluice because we instead plan on moving all S3 file operations to S3DistCp, which can leverage the IAM credentials on the EMR cluster itself.

In the meantime, if you want you can disable the archive step (--skip archive_enriched) and replace it in your job DAG with a few lines of Boto, which can of course leverage IAM roles.


#3

Thanks, trying that out next.


#4

I re-ran skipping everything except “load” and still got the same error. So I’m going to assume that step also uses Sluice and just leave the key/secret in place for now. I have other issues to resolve and need to get the software working ASAP. I might revisit later, but right now I need to get things running to unblock developers.

— cut here —
$ snowplow-storage-loader --config ./redshift.conf --skip archive_enriched,analyze,shred,delete
Loading Snowplow events and shredded types into My Redshift database (Redshift cluster)…
Unexpected error: Expected(200) <=> Actual(403 Forbidden)
excon.error.response
:body => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\nInvalidAccessKeyIdThe AWS Access Key Id you provided does not exist in our records.ASIAJNBLPANIURxxxxxxetc…


#5

Ah - yes sorry, the StorageLoader also uses Sluice to determine what shredded JSONs in S3 need loading into Redshift.

Best to wait on our rewrite of StorageLoader into Scala to resolve all this.