[SOLVED] S3DistCp is not deleting files on the etl first step


#1

Currently we are running our Snowplow ETL runner at the the version 106.
At the first EMR step it is running the s3DistCp to copy the source files in S3 to the etl-processing S3 folder at different accounts, the command is like this:

/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src [source_bucket] --dest [dest_bucket] --s3Endpoint s3-eu-west-1.amazonaws.com --srcPattern .*localhost\_access\_log.*\.txt.* --deleteOnSuccess --groupBy .*/_*(.+)

The command has the parameter --deleteOnSuccess but when I am copying the files from the S3 source bucket, that is in a different amazon account, the files are copied but not deleted.
Testing the same process using a S3 source bucket in the same account than the EMR job it works fine, deleting the files after copy.

The EC2 role has permissions to read and delete files in the source bucket, and also the bucket has permissions to read and delete files for the EC2 role.
I reviewed the bucket permissions and I can manually delete the files using the AWS cli.
the permission at the bucket are:
{
“Sid”: “”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: [
[AWS EC2 role]
]
},
“Action”: [
“s3:ListBucketVersions”,
“s3:ListBucket”,
“s3:GetObjectVersion”,
“s3:GetObject”,
“s3:DeleteObject”
],
“Resource”: [
[AWS Bucket]
[AWS Bucket folders]
]
},
the role permissions are:
{
“Sid”: “”,
“Effect”: “Allow”,
“Action”: [
“s3:GetObject”,
“s3:ListBucket”,
“s3:DeleteObject”
],
“Resource”: [
[AWS bucket],
[AWS bucket folder]
]
}

Does anyone know why it is happening?
Do I need to add different permissions or the S3DistCp can’t delete files stored in a different AWS account?

Thank you in advance
Rafael Bottega


#2

Hi Rafael,

Is the owner of those objects on S3 on a different account than the bucket and/or the EMR role that is trying to delete them?


#3

Hi knservis,
We have one AWS account where we capture some data using elasticbeanstalk and store it on S3 the owner of these files is from this account.
In another account we are running the Snowplow EMR job copying those files from the first account to this second account S3 bucket.
The files can be copied but the S3DistCp is not deleting the files from the source in the end.

I hope it answer your question


#4

Hey @boittega

sorry for the late reply. Logging on to an EMR machine, with the same configuration and in the same account as you are using it to run s3distcp can you manually delete one of the files that you are trying to (e.g. using hdfs dfs -rm s3://blah)?


#5

Hi @knservis,
Thank you for your time trying to help me, but I discover that it is an AWS “bug” related to S3DistCp code managed by amazon and I need to add the S3:PutObject permissions between the role and the bucket.
I raised that with the AWS support and I hope they will update the documentation or the code.
It is solved now.